Build Live Voice Agents, Translators, and Transcription Tools with OpenAI GPT Realtime
OpenAI shipped its production speech-to-speech model, gpt-realtime, with general availability in August 2025. This guide shows developers how to build three real things with it: a live voice agent, a real-time speech translator, and a live transcription tool — using WebRTC, WebSocket, and the Agents SDK.
Primary Focus
ai &-machine-learningAI Tools Covered
What You'll Learn
- ✓Getting Started — How the Realtime API Actually Works
- ✓Core Concepts — Sessions, Events, and Voice Activity Detection
- ✓Use Case 1 — Building a Live Voice Agent
- ✓Use Case 2 — Building a Real-Time Speech Translator
- ✓Use Case 3 — Building a Live Transcription Tool
- ✓Production Concerns — Cost, Context, and Session Management
Guide Curriculum
Foundation
Learn key concepts
- •Getting Started — How the Realtime API Actually Works2m
- •Core Concepts — Sessions, Events, and Voice Activity Detection2m
Implementation
Learn key concepts
- •Use Case 1 — Building a Live Voice Agent3m
- •Use Case 2 — Building a Real-Time Speech Translator2m
- •Use Case 3 — Building a Live Transcription Tool2m
Mastery
Learn key concepts
- •Production Concerns — Cost, Context, and Session Management2m
- •Advanced Configuration — Prompts, Sideband Connections, and EU Data Residency1m
- •Next Steps — What to Build Next3m
Preview: First Lesson
Foundation
Getting Started — How the Realtime API Actually Works
The Realtime API is not a standard request-response API. You do not send a complete audio file and get a complete transcript back. Instead, you open a persistent, bidirectional connection to OpenAI's servers and stream audio in one direction while streaming audio (or text) back in the other — in real time, as the conversation unfolds.
There are three connection methods, each suited to a different deployment context:
WebRTC is the browser-native approach. Your web page captures microphone audio via getUserMedia(), establishes a WebRTC peer connection to OpenAI, and receives the model's audio output directly in the browser with sub-200ms latency. This is the right choice for web apps where the browser talks directly to the API. Because it exposes an API key client-side, you use short-lived ephemeral keys — your server mints a single-use key and passes it to the browser. The browser never touches your main API key.
WebSocket is the application-server approach. Your server opens a WebSocket to wss://api.openai.com/v1/realtime and streams audio over that connection. This is the right choice for server-side pipelines, mobile backends, or any context where you want to keep business logic and tool calls server-side and out of the browser.
SIP is the telephony approach. If you are routing phone calls through the Realtime API, OpenAI supports Session Initiation Protocol connections, enabling call-center automation, IVR replacement, and phone-based voice agents. Au
Start learning with this comprehensive guide
This guide includes:
About the Author
Hiram Clark is the founder of vybecoding.ai and editor of every guide and news article published on the site. He reviews all AI-drafted content for accuracy before publication and is personally accountable for factual errors. He works hands-on with the AI development tools, workflows, and infrastructure covered here.
Full Guide Content
Complete lesson text — start the interactive course above for exercises and progress tracking.
Module 1Foundation
1.1Getting Started — How the Realtime API Actually Works
The Realtime API is not a standard request-response API. You do not send a complete audio file and get a complete transcript back. Instead, you open a persistent, bidirectional connection to OpenAI's servers and stream audio in one direction while streaming audio (or text) back in the other — in real time, as the conversation unfolds.
There are three connection methods, each suited to a different deployment context:
WebRTC is the browser-native approach. Your web page captures microphone audio viagetUserMedia(), establishes a WebRTC peer connection to OpenAI, and receives the model's audio output directly in the browser with sub-200ms latency. This is the right choice for web apps where the browser talks directly to the API. Because it exposes an API key client-side, you use short-lived ephemeral keys — your server mints a single-use key and passes it to the browser. The browser never touches your main API key.
WebSocket is the application-server approach. Your server opens a WebSocket to wss://api.openai.com/v1/realtime and streams audio over that connection. This is the right choice for server-side pipelines, mobile backends, or any context where you want to keep business logic and tool calls server-side and out of the browser.
SIP is the telephony approach. If you are routing phone calls through the Realtime API, OpenAI supports Session Initiation Protocol connections, enabling call-center automation, IVR replacement, and phone-based voice agents. Audio must be converted to 16-bit PCM, 24 kHz mono before being forwarded.
All three methods share the same core model (gpt-realtime) and the same event-driven session model. You configure the session by sending session.update events, receive transcript and audio delta events as the model processes input, and respond to function-call events when the model invokes tools.
1.2Core Concepts — Sessions, Events, and Voice Activity Detection
The session is the fundamental unit of the Realtime API. When you connect, you create or update a session that defines how the model behaves for the life of that connection. Sessions can now last up to 60 minutes (up from 30 minutes in the beta). The gpt-realtime model has a 32,768-token context window; responses consume up to 4,096 tokens, leaving 28,672 tokens of usable input per turn.
The session is configured via session.update events — JSON messages you send over the WebSocket or WebRTC data channel. A minimal session configuration looks like this:
ws.send(JSON.stringify({
type: "session.update",
session: {
model: "gpt-realtime",
turn_detection: { type: "server_vad" },
modalities: ["text", "audio"]
}
}));
Turn detection is how the model knows when you have finished speaking. server_vad (server-side Voice Activity Detection) is the standard choice — the API detects pauses in the audio stream and triggers a model response automatically. You do not need to send an explicit "the user stopped talking" event; the server handles it. For use cases like push-to-talk interfaces or batch audio processing, you can disable VAD and commit audio buffers manually.
Voices. The GA release includes 10 voices total: 8 existing voices plus two new ones — cedar and marin — exclusive to the Realtime API. OpenAI recommends marin and cedar for best assistant voice quality.
Modalities. Sessions support ["text", "audio"] together or ["text"] alone. For voice agents you want both. For transcription-only use cases, you run a transcription-type session (covered in Module 3) which omits the spoken response entirely.
Idle timeouts. New with GA: you can configure the model to speak up when the user goes silent, rather than waiting indefinitely. Setting idle_timeout_ms: 6000 causes the API to fire an input_audio_buffer.timeout_triggered event after 6 seconds of silence post-response, prompting the model to check in with the user. This is essential for phone-call-style agents where a dead line is ambiguous.Module 2Implementation
2.1Use Case 1 — Building a Live Voice Agent
A live voice agent is an AI that listens to the user, reasons about what they said, and speaks back — all in real time, with no perceptible gap. The canonical tool for this is the OpenAI Agents SDK, which wraps the Realtime API WebSocket transport and handles session lifecycle, event routing, and tool-call management for you.
Install the SDK with voice support:pip install 'openai-agents[voice]'
A minimal voice agent in Python:
import asyncio
from agents.realtime import RealtimeAgent, RealtimeRunner
agent = RealtimeAgent(
name="Assistant",
instructions="You are a helpful voice assistant. Keep responses short and conversational.",
)
runner = RealtimeRunner(
starting_agent=agent,
config={
"model_settings": {
"model_name": "gpt-realtime-2",
"audio": {
"input": {
"format": "pcm16",
"transcription": {"model": "gpt-4o-mini-transcribe"},
"turn_detection": {
"type": "semantic_vad",
"interrupt_response": True,
},
},
"output": {
"format": "pcm16",
"voice": "marin",
},
},
}
},
)
async def main() -> None:
session = await runner.run()
async with session:
await session.send_message("Say hello in one short sentence.")
async for event in session:
if event.type == "audio":
# Play or forward event.audio.data to the user's speaker
pass
elif event.type == "history_added":
print(event.item)
elif event.type == "agent_end":
break
elif event.type == "error":
print(f"Error: {event.error}")
if __name__ == "__main__":
asyncio.run(main())
A note on model naming: the prose and the WebSocket session.update examples above use gpt-realtime — that is how OpenAI's announcement and developer docs refer to the model. The Agents SDK code uses gpt-realtime-2 as the model identifier. These point to the same production speech-to-speech model; the -2 is the API-level generation suffix you pass programmatically (confirmed on OpenAI's pricing page and the Agents SDK quickstart). When you wire this up yourself, use gpt-realtime-2 as the model name in SDK config and gpt-realtime in the raw Realtime API session config — both resolve to the current GA model.
This example, drawn from the OpenAI Agents SDK quickstart, sets up:
semantic_vadfor turn detection — this is more accurate than pure audio-energy VAD because it uses model understanding of sentence completion, not just silence detection.interrupt_response: True— if the user starts speaking while the model is still talking, the model stops and listens. Essential for natural-feeling conversations.gpt-4o-mini-transcribefor generating a text transcript of the user's audio alongside the audio processing.
For production, add tool definitions to the agent to let it call APIs, look up data, or trigger actions on behalf of the user. Async function calling in the GA model allows the conversation to continue naturally while a tool call is pending — the model says something like "I'm looking that up now" rather than going silent.
2.2Use Case 2 — Building a Real-Time Speech Translator
A real-time speech translator listens to a speaker in one language and streams spoken output in another language — live, with no stop-and-start. This is where gpt-realtime's speech-to-speech architecture shines: the model can switch output language mid-stream without a separate translation step.
The session configuration is the same as a voice agent, with one critical instruction in the system prompt: tell the model exactly what to do. OpenAI's developer notes on the GA release emphasize that the new model's instruction-following is much more precise than the beta — "a prompt that said, 'Always say X when Y,' may have been treated by the old model as vague guidance, whereas the new model may adhere to it in unexpected situations." This precision is exactly what a translator needs.
Session configuration for a Spanish-to-English translator:ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
instructions: "You are a simultaneous interpreter. When the user speaks in Spanish, immediately translate what they said into English and speak the translation aloud. Do not add commentary. Translate only.",
audio: {
input: {
turn_detection: { type: "server_vad" }
},
output: {
voice: "cedar"
}
},
modalities: ["text", "audio"]
}
}));
For bidirectional translation (where either party may speak either language), extend the instructions to handle both directions: "If the user speaks in Spanish, translate to English. If the user speaks in English, translate to Spanish." The model's language detection is part of its audio reasoning and operates without a separate classifier step.
One important caveat to verify before deploying: The GA model documentation notes that multilingual robustness decreases after multiple conversation turns, particularly for accented speech. For professional-grade translation products, pilot with representative speakers from your target language region and test session lengths matching real use.For a browser-based translator using WebRTC — where two parties in different locations connect — you will want the sideband connection architecture: the client browser holds the audio connection, and your application server holds a second connection to the same session to monitor and log the transcript without the audio latency of routing through a server.
2.3Use Case 3 — Building a Live Transcription Tool
Live transcription is simpler than a full voice agent because you do not need a spoken response. The Realtime API supports a dedicated transcription session type that streams transcript deltas as audio arrives — users see words appearing on screen before they have finished their sentence.
This is optimized differently from the voice-agent path. You use gpt-realtime-whisper as the transcription model — it is natively streaming and designed for this real-time delta use case, unlike gpt-4o-transcribe which is better for batch accuracy.
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "transcription",
audio: {
input: {
format: { type: "audio/pcm", rate: 24000 },
transcription: {
model: "gpt-realtime-whisper",
language: "en"
},
turn_detection: {
type: "server_vad",
threshold: 0.5
}
}
}
}
}));
Handling transcript events:
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "conversation.item.input_audio_transcription.delta") {
// Incremental transcript fragment — append to the display
appendToCaption(data.delta);
} else if (data.type === "conversation.item.input_audio_transcription.completed") {
// Final, corrected transcript for the turn
finalizeCaption(data.transcript);
}
};
The delta events arrive continuously as speech is recognized. The completed event arrives when the speaker pauses and gives you the final, corrected text for that utterance. A good live-transcription UI uses delta events to show "in progress" text in real time, then replaces it with the completed event output.
The language parameter is optional — if omitted, the model auto-detects language. Setting it explicitly improves accuracy for known-language scenarios and reduces the chance of misidentification, especially for short utterances. Available transcription models include gpt-realtime-whisper, gpt-4o-transcribe, and gpt-4o-mini-transcribe for cost-accuracy tradeoffs.
Module 3Mastery
3.1Production Concerns — Cost, Context, and Session Management
The Realtime API charges per audio token, not per minute. Audio input costs $32 per million tokens; audio output costs $64 per million tokens. Cached input tokens cost $0.40 per million. At these rates, a 10-minute voice session consuming moderate audio in and out will run roughly in the range of a few cents — but the math depends heavily on how much the model talks versus listens. Output tokens are twice the cost of input tokens, so agents that talk a lot cost more than agents that mostly listen and transcribe.
Managing costs for long sessions. The context window fills up over a long conversation. When it hits the 28,672-token input ceiling, the API begins truncating (dropping) the oldest messages automatically. This is useful for session continuity but disrupts prompt caching — dropping old messages changes the context prefix, which busts the cache and means you pay full price for those tokens on the next turn.The mitigation: configure truncation to remove a bigger chunk less often, rather than a small chunk every turn. Set retention_ratio: 0.8 to truncate 20% of the context window in one go when truncation occurs. This keeps the cache-eligible prefix stable for longer:
ws.send(JSON.stringify({
type: "session.update",
session: {
truncation: {
type: "retention_ratio",
retention_ratio: 0.8
}
}
}));
Session limits: maximum 60 minutes per session. For applications that need longer continuity (e.g., all-day transcription), you will need to manage session reconnection and carry over any context state manually.3.2Advanced Configuration — Prompts, Sideband Connections, and EU Data Residency
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
prompt: {
id: "pmpt_abc123",
version: "12",
variables: {
user_name: "Natalie",
language: "Spanish"
}
}
}
}));
Any session fields you pass alongside the prompt override the prompt's stored values, giving you runtime flexibility without losing prompt version control.
Sideband connections. For production use cases where tool calls, business logic, or compliance logging must stay server-side, the sideband architecture creates two simultaneous connections to the same session: the client browser or phone endpoint holds the audio connection (keeping latency low), while your application server holds a second control connection to monitor the session, inject instructions, and respond to tool calls. Neither connection sees the other's private credentials. EU data residency. If you are building for GDPR-sensitive contexts, thegpt-realtime-2025-08-28 model supports EU data residency. You must explicitly enable this at the organization level and route requests through https://eu.api.openai.com instead of the standard endpoint.3.3Next Steps — What to Build Next
A practical progression for developers getting started:
- Start with the Realtime Playground. OpenAI's browser-based playground lets you experiment with session configuration, voice selection, and prompts without writing any code. Use it to validate your prompt and voice choice before building.
- Run the WebRTC quickstart. The webrtcHacks single-file demo is a working WebRTC + gpt-realtime connection in vanilla JavaScript with no build step. The best way to understand what the raw events look like before adding framework abstractions.
- Build the transcription tool first. It is the simplest Realtime API use case — no output audio to manage, clear success criteria (the transcript is right or wrong), and a meaningful standalone product. Get that working end-to-end before adding the complexity of a full voice agent.
- Add tools to the voice agent. Async function calling in the GA model means the conversation keeps flowing while a tool is pending. Connect a search tool, a calendar API, or a database lookup and watch the agent handle multi-step requests without going silent.
- Instrument your sessions. Log every session event to a structured store: turns, token counts, tool calls, errors. The developer console traces added with GA give you debugging visibility, but your own logs give you cost attribution per session type — essential once you have multiple agents in production.
The Realtime API's GA release with gpt-realtime represents the point where voice-native AI applications became genuinely viable to ship rather than to prototype. The architecture has stabilized, the pricing is predictable, and the tooling — the Agents SDK, hosted prompts, sideband connections — exists to take products past proof-of-concept.
Primary sources used in this guide:
- OpenAI — Introducing gpt-realtime and Realtime API updates for production voice agents (August 28, 2025) — GA announcement, pricing, model benchmarks, new voices.
- OpenAI Developers Blog — Developer notes on the Realtime API — session limits, feature matrix, async function calling, idle timeouts, truncation config, sideband connections, EU data residency.
- OpenAI Agents SDK — Realtime quickstart —
RealtimeAgentandRealtimeRunnercode example. - OpenAI Developers — Realtime transcription guide — transcription session type,
gpt-realtime-whisper, delta events. - Fora Soft / Medium — Integrating OpenAI Realtime API with WebRTC, SIP, and WebSockets — session.update patterns, connection method comparison, sub-200ms latency figure.
- webrtcHacks — gpt-realtime-webrtc single-file demo — minimal WebRTC reference implementation.