qwertox 4 hours ago

> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.

-

This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.

-

Edit: Apparently it does.

It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)

and `response.done` [1] with the response text.

[0] https://platform.openai.com/docs/api-reference/realtime-serv...

[1] https://platform.openai.com/docs/api-reference/realtime-serv...

  • bcherry 2 hours ago

    yes it transcribes inputs automatically, but not in realtime.

    outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.

  • tough 3 hours ago

    saw velvet show hn the other dya, could be usful for storng these https://news.ycombinator.com/item?id=41637550

    • BoorishBears an hour ago

      OpenAI just launched the equivalent of Velvet as a full fledged feature today.

      But seperate from that you typically want some application specific storage of the current "conversation" in a very different format than raw request logging.

101008 3 hours ago

I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.

The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)

I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)

  • com2kid 2 hours ago

    > I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.

    The market for realistic voice agents is huge, but also very fragmented. Customer service is the obvious example, large companies employ tens of thousands of customer service phone agents, and a large # of those calls can be handled, at least in part, with a sufficiently smart voice agent.

    Sales is another, just calling back leads and checking in on them. Voice clone the original sales agent, give the AI enough context about previous interactions, and a lot of boring legwork can be handled by AI.

    Answering simple questions is another great example, restaurants get slammed with calls during their busiest hours (seriously getting ahold restaurant staff during peak hours can be literally impossible!) having an AI that can pick up the phone and answer basic questions (what's in certain dishes, what is the current wait time, what is the largest group that can be sat together, etc) is super useful.

    A lot of small businesses with only a single employee can benefit from having a voice AI assistant picking up the phone and answering the easy everyday queries and then handing everything else off to the owner.

    The key is that these voice AIs should be seamless, you ask your question, they answer, and you ideally don't even know it is an AI.

    • axus an hour ago

      And after your mis-led by a sales agent, it doesn't make you as angry because it's just an AI.

      • 93po an hour ago

        they're definitely going to instruct the AI agents to lie to you, and deliberately waste your time, and be pushier than ever, because it's not costing them anything to have a real human on the line even longer. at least we'll have our own agents to waste their compute in turn

        • com2kid 44 minutes ago

          Any company that is that scummy already has sales people working for it who are that scummy and lying non-stop.

          The AI isn't changing that equation at all.

          • JamesBarney 25 minutes ago

            AI is actually better here.

            1. AI instructions are legible. There is no record asking John to sell the customer things they don't need. There is a record if the AI does it.

            2. AI interactions are legible. If a sales guy tells you something false on a zoom call, there is no record of it. If the AI does, there is a record.

  • bcherry an hour ago

    keep in mind that this is just v1 of the realtime api. they'll add realtime vision/video down the road which can also have wide applications beyond synchronous communication.

  • ilaksh 2 hours ago

    You're right, having a voice conversation for any reason is just so passe these days. They should stop adding microphones to phones and everything. So old-fashioned and inefficient. And who wants to ever have to actually talk to someone or some AI to ask for anything? I'm sure our vocal cords will evolve away soon. They are so primitive. Vestigial organs.

alach11 an hour ago

It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.

siva7 3 hours ago

I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software

  • IdiocyInAction 3 hours ago

    AFAIK a lot of these ideas are not new (the JSON thing was done with OS models before) and OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?), so I think this is actually all within expectations.

    • throwup238 28 minutes ago

      > OpenAI is possibly the hottest startup with the most funding this decade (maybe even past two decades?)

      It depends on how you define startup but I don't think they will surpass Uber, ByteDance, or SpaceX until this next rumored funding round.

      I'm excluding companies that have raised funding post IPO since that's an obvious cutoff for startups. The other cuttof being break even, in which case Uber has raised well over $20 billion.

    • sk11001 2 hours ago

      They're exceptional at executing and delivering, you don't get that just through having more funding.

      • testfrequency an hour ago

        It’s literally just a bunch of ex-stripe employees and data scientists..

      • jiggawatts 2 hours ago

        How are they exceptional?

        Their web UI was a glitchy mess for over a year. Rollouts of just data is staggered and often delayed. They still can’t adhere to a JSON schema accurately, even though others have figured this out ages ago. There are global outages regularly. Etc…

        I’m impressed by some aspects of their rapid growth, but these are financial achievements (credit due Sam) more than technical ones.

        • closewith an hour ago

          I have a few qualms with this app:

          1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

          2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

          3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?

        • hobofan an hour ago

          Not sure why you are being downvoted. You are generally right. Most of their new product rollouts were acoompanied with huge production instabilities for paying customers. Only in the most recent ones did they manage that better.

          > They still can’t adhere to a JSON schema accurately

          Strict mode for structured output fixes at least this though.

  • pheeney 3 hours ago

    I wonder how much they use their own products internally to speed up development and decisions.

    • abound 2 hours ago

      They definitely use their own products internally, perhaps to a fault: While chatting with OpenAI recruiters, I received calendar events with nonsensical DALLE-generated calendar images, and "interview prep" guides that were clearly written by an older GPT model.

    • amlib 3 hours ago

      And I wonder how much they use them externally to influence the online conversations about their own products/company.

  • roboboffin 3 hours ago

    Is it that most models are based on the transformer architecture ? And so performance improvements can then we used throughout their different products ?

  • nextworddev 37 minutes ago

    GPT 5 is writing their code

ponty_rick 4 hours ago

> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.

Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?

[ {key1:value1}, {key2:value2} ]

  • benatkin 2 hours ago

    > even though JSON is supposed to ignore key order

    Most tools preserve the order. I consider it to be an unofficial feature of JSON at this point. A lot of people think of it as a soft guarantee, but it’s a hard guarantee in all the recent JavaScript and python versions. There are some common places where it’s lost, like JSONB in Postgres, but it’s good to be aware that this unofficial feature is commonly being used.

serjester 4 hours ago

The eval platform is a game changer.

It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.

There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.

N_A_T_E 2 hours ago

I just need their API to be faster. 15-30 seconds per request using 4o-mini isn't good enough for responsive applications.

  • carlgreene 2 hours ago

    That is odd. Longest I’ve experienced in my use of it is a few seconds.

  • BoorishBears an hour ago

    You should try Azure: it comes with dedicated capacity which is typically a very expensive "call our sales team" feature with OpenAI

superdisk 4 hours ago

Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.

  • jacooper 4 hours ago

    Try notebook LM, it's the chatgpt moment for Google's deepmind

    • world2vec 2 hours ago

      I wish I could but not available in UK, IIRC

modeless 2 hours ago

I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.

thenameless7741 4 hours ago

Blog updates:

- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/

- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...

- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/

- Model Distillation in the API: https://openai.com/index/api-model-distillation/

Docs updates:

- Realtime API: https://platform.openai.com/docs/guides/realtime

- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision

- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching

- Model Distillation: https://platform.openai.com/docs/guides/distillation

- Evaluating model performance: https://platform.openai.com/docs/guides/evals

Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396

- New prompt generator on https://playground.openai.com

- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)

Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582

- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).

  • visarga 4 hours ago

    > Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it.

    So regular paying users from EU are still left out in the cold.

    • Version467 3 hours ago

      Yes, but it works with a vpn and the change in latency isn’t big enough to have a noticeable impact on usability.

    • AlanYx 4 hours ago

      It's probably stuck in legal limbo in the EU. The recently passed EU AI Act prohibits "AI systems aiming to identify or infer emotions", and Advanced Voice does definitely infer the user's emotions.

      (There is an exemption for "AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use", but Advanced Voice probably doesn't benefit from that exemption.)

      • qwertox 4 hours ago

        Apparently this prohibition only applies to "situations related to the workplace and education", and, in this context, "That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons"

        So it seems to be possible to use this in a personal context.

        https://artificialintelligenceact.eu/recital/44/

        > Therefore, the placing on the market, the putting into service, or the use of AI systems intended to be used to detect the emotional state of individuals in situations related to the workplace and education should be prohibited. That prohibition should not cover AI systems placed on the market strictly for medical or safety reasons, such as systems intended for therapeutical use.

        • AlanYx 3 hours ago

          This is true, though it may not make sense commercially for them to offer an API that can't be used for workplace (business) applications or education.

          • qwertox 3 hours ago

            I see what you mean, but I think that "workplace" specifically refers to the context of the workplace, so that an employer cannot use AI to monitor the employees, even if they have been pressured to agree to such a monitoring. I think this is unrelated to "commercially offering services which can detect emotions".

            But then I don't get the spirit of that limitation, as it should be just as applicable to TVs listening in on your conversations and trying to infer your emotions. Then again, I guess that for these cases there are other rules in place which prohibit doing this without the explicit consent of the user.

            • runako 3 hours ago

              > I think that

              > I think this

              > I don't get the spirit of that limitation

              > I guess that

              In a nutshell, this uncertainty is why firms are going to slow-roll EU rollout of AI and, for designated gatekeepers, other features. Until there is a body of litigated cases to use as reference, companies would be placing themselves on the hook for tremendous fines, not to mention the distraction of the executives.

              Which, not making any value judgement here, is the point of these laws. To slow down innovation so that society, government, regulation, can digest new technologies. This is the intended effect, and the laws are working.

minimaxir 4 hours ago

From the Realtime API blog post: https://openai.com/index/introducing-the-realtime-api/

> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.

As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.

This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.

  • OutOfHere 3 hours ago

    > and $0.24 per minute of audio output

    That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.

    • minimaxir 3 hours ago

      Fair, it wouldn't work well for on-demand generation in an app, but for ad-hoc cases like a voice-over it's not a huge expense.

      If OpenAI decides to fully ignore ethics and dive deep into voice cloning, then all bets are off.

og_kalu 3 hours ago

Image output for 4o in the API would be very nice but i'm not sure if that's at all in the cards.

Audio output in the api now but you lose image input. Why ? That's a shame.

nielsole 4 hours ago

> The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.

I guess this is using their "old" turn-based voice system?

lysecret 3 hours ago

Using structured outputs for generative ui is such a cool idea does anyone know some cool web demos related to this ?

  • jiggawatts 41 minutes ago

    I just had an evil thought: once AIs are fast enough, it would be possible to create a “dynamic” user interface on the fly using an AI. Instead of Java or C# code running in an event loop processing mouse clicks, in principle we could have a chat bot generate the UI elements in a script like WPF or plain HTML and process user mouse and keyboard input events!

    If you squint at it, this is what chat bots do now, except with a “terminal” style text UI instead of a GUI or true Web UI.

    The first incremental step had already been taken: pretty-printing of maths and code. Interactive components are a logical next step.

    It would be a mere afternoon of work to write a web server where the dozens of “controllers” is replaced with a single call to an LLM API that simply sends the previous page HTML and the request HTML with headers and all.

    “Based on the previous HTML above and the HTTP request below, output the response HTML.”

    Just sprinkle on some function calling and a database schema, and the site is done!

    • ghthor 6 minutes ago

      That actually sounds pretty entertaining. Especially if there is dynamic user input, like text box input

sammyteee 3 hours ago

Loving these live updates, keep em coming! Thanks Simon!

hidelooktropic 4 hours ago

Any word on increased weekly caps on o1 usage?

  • zamadatix an hour ago

    Weekly caps are for standard accounts (not going to be talked about at DevDay). The blog does note RPM changes for the API though:

    "10:30 They started with some demos of o1 being used in applications, and announced that the rate limit for o1 doubled to 10000 RPM (from 5000 RPM) - same as GPT-4 now."