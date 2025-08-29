Technology News
OpenAI Introduces GPT-Realtime Speech Generation Model, Makes Realtime API Generally Available

OpenAI’s GPT-Realtime is reportedly the company’s most advanced voice model, designed for customer support and assistance.

Written by Akash Dutta, Edited by Ketan Pratap | Updated: 29 August 2025 13:21 IST
Photo Credit: OpenAI

OpenAI’s new speech generation model can also analyse and read text in images

Highlights
  • OpenAI said the model was trained in collaboration with companies
  • GPT-Realtime will be available with new Cedar and Marine voices
  • The Realtime API was first released as a public beta in October 2024
OpenAI, on Thursday, announced a new artificial intelligence (AI) speech generation model dubbed GPT-Realtime. This is an enterprise-focused model that is capable of generating native audio with low latency, enabling two-way, real-time voice conversations. The San Francisco-based AI firm said that compared to its existing voice models, the Realtime model offers higher quality output, lower processing times, as well as additional features such as tool calling, support for remote Model Context Protocol (MCP) servers and image input, and the ability to detect alphanumeric sequences in select non-English languages.

OpenAI Brings New Speech Model for Enterprises

In a post, the AI firm announced the release of its most advanced speech generation model, GPT-Realtime. To explain, a speech generation model is different from the traditional voice assistants that companies use for customer support. Those chains together multiple systems, such as text-to-speech and speech-to-text, to carry out a voice conversation with a human. In comparison, the OpenAI model can natively process speech input and generate corresponding speech output, resulting in significantly lower response times.

GPT-Realtime features several new and enhanced capabilities. Similar to Advanced Voice Mode, it is capable of generating a highly expressive and natural-sounding voice, which developers can fine-tune with text-based instructions. Two new voices are being introduced, male voice Cedar and female voice Marin, and the company is also updating the existing eight voices.

In terms of performance, the model can capture non-verbal cues, such as laughter, and respond to them. It can also switch languages mid-sentence and adapt to the user's tone. Based on internal evaluations, OpenAI claims that the model displays higher performance in detecting alphanumeric sequences (such as phone and policy numbers) in non-English languages, such as Chinese, French, Japanese, and Spanish.

The company claimed that GPT-Realtime scored 82.8 percent on the Big Bench Audio benchmark, which measures a voice model's accuracy and reasoning ability. This is significantly higher than its predecessor from December 2024, which scored 65.6 percent.

Additionally, OpenAI claimed that the speech generation model has higher instruction adherence, supports function and tool calling, and can be configured to support remote MCP servers. It can also analyse and read images, allowing use cases where users can upload an image for better context, and the model can then incorporate it into the conversation.

Notably, GPT-Realtime is an enterprise-focused offering, and it is exclusively available with the company's Realtime API, which is now generally available to all developers. The API was first introduced in October 2024 as a public beta.

Coming to the model's pricing, GPT-Realtime will cost developers $32 (roughly Rs. 2,800) per million input and $64 (roughly Rs. 5,600) per million output tokens. Cached input tokens (per million) are priced at $0.40 (roughly Rs. 35).

