OpenAI has released GPT-4o (the ‘o’ means omni), which is an enhanced version of GPT-4 that can accept any combination of text, sound, or images as input and respond in all three formats as well.
On the GPT-4o announcement page, a video demonstrates a conversation between an OpenAI employee and GPT-4o in which ChatGPT responds in a conversational tone that sounds strikingly similar to that of a human — warm, realistic, and expressive. This puts its voice far ahead of its competitors Siri and Google Assistant, which sound robotic in comparison.
GPT-4o was also asked to guess where he is based on his live camera feed, and it accurately described that he is in a recording/production studio. It also noticed his environment and asked questions about it before he even mentioned it. The improvements are largely in the audio and visual area, which now makes it a competitor of Siri and Google Assistant.
OpenAI claims that it responds in real time (which it did in the demonstration video). The company said it replies to audio inputs in as little as 232 milliseconds, and up to 320 milliseconds. It used to take 5.4 seconds to get a response from GPT-4 if you sent it a question via voice recording. The company also touts that it is 50% cheaper than GPT-4 if you are using the API to develop your own chat bot.