Build a production voice agent in Python that handles live audio, barge-in mid-sentence, function calls, phone-call attach over SIP, and live speech translation. This book ships working Python code on OpenAI's GA Realtime API across all three new models (gpt-realtime-2 for speech-to-speech with GPT-5-class reasoning and a 128K context window, gpt-realtime-translate for live translation across 70 input and 13 output languages, and gpt-realtime-whisper for streaming transcription) and all three transports (WebSocket, SIP, browser WebRTC) on the openai-agents SDK. You build a real voice agent end to end. Capture live PCM16 audio from a microphone, route it into a gpt-realtime-2 WebSocket session, play model audio back without blocking the asyncio event loop, and handle interruption with conversation truncation. Wire up function calling with voice-safe input validation and per-tool authorization. Add streaming transcription on the same session. Build a live-translation pipeline on the dedicated translate endpoint with the new marin voice. Attach inbound phone calls over SIP. Mint ephemeral keys for a browser WebRTC client and run Python sideband control. Then ship: latency budgets, cost modeling, evaluation harnesses, blue-green deploys, and reconnection policy. What you build: - A reusable RealtimeRunner factory with frozen SessionConfig, validated against the 10 GA voices and the 3 GA model IDs - A live audio pipeline that bridges sounddevice background threads and asyncio with bounded queues and explicit thread-to-async handoffs - Manual turn control, server VAD, semantic VAD, and barge-in handling with response.cancel + conversation.item.truncate aligned to what the user actually heard - Function calling with a ToolRegistry, voice-safe argument coercion, redaction of spoken PII, and per-tool authorization tied to the caller context - A live-translation service with separate source-transcript, target-text, and target-audio streams, pace preservation, and the marin voice for translation playback - SIP attach with reconnect policy, telephony-grade latency budgets, and a server that stays out of the media path - Browser WebRTC with ephemeral key minting at /v1/realtime/client_secrets, sideband control over a separate Python channel, and correlation-id-based state sharing - Observability and evals: latency budgets, cost estimation at the per-session level, transcript-quality scoring, and CI-failing regression detection - Deployment: tenant-scoped sessions, blue-green rotation, reconnection policy, and an end-to-end multilingual phone-line case study What makes this book different: - GA-first, zero beta drag. The beta endpoint and gpt-4o-realtime-preview* models were removed May 7, 2026. This book targets only the GA surface with the correct model IDs and endpoint paths. - All three GA models in one book. Most upcoming Realtime books will cover only the chat model. This book ships dedicated chapters for translation and streaming transcription with their endpoint-specific behavior. - Three transports, with their boundaries. WebSocket, SIP, and browser WebRTC each get a chapter with the architectural rules called out. WebRTC is JavaScript-only by design; the Python role is ephemeral keys and sideband. - Production discipline. Latency budgets, cost models, reconnection, and CI-failing evals ship as code, not as appendix advice. Prerequisites: Intermediate Python (3.11+). Familiarity with asyncio and basic audio concepts. An OpenAI API key with Realtime access. No prior voice-agent experience required. Three GA models. Three transports. Production Python in 12 chapters. Build for the stream.
| Gtin | 09798903770441 |
| Age_group | ADULT |
| Condition | NEW |
| Gender | UNISEX |
| Product_category | Gl_book |
| Google_product_category | Media > Books |
| Product_type | Books > Subjects > Computers & Technology > Computer Science > AI & Machine Learning > Natural Language Processing |