Skip to content

AI Assistants Shine in Speaking, Struggle with Audio Understanding

AI assistants may talk the talk, but can they walk the walk? A new study reveals their limitations in understanding and processing audio inputs.

In this picture we see many young persons. Here we see a person who is standing, he holds a...
In this picture we see many young persons. Here we see a person who is standing, he holds a microphone and he is talking about something. Remaining all are sitting on chairs and listening. It looks like an office room.

AI Assistants Shine in Speaking, Struggle with Audio Understanding

Researchers have assessed the capabilities of AI assistants, including GPT-4o-Audio, using a new benchmark called VoiceAssistant-Eval. The study revealed that while AI systems excel at speaking, they often struggle with understanding audio and processing combined inputs.

The evaluation, which considered twenty-one open-source models alongside GPT-4o-Audio, found that most AI systems excel at speaking tasks. However, they face challenges in accurately interpreting audio information and processing combined audio and visual inputs.

VoiceAssistant-Eval, a comprehensive benchmark, assesses AI assistants' performance across listening, speaking, and viewing tasks. It comprises over ten thousand carefully curated examples, covering thirteen distinct task categories. Notably, a mid-sized AI model, Step-Audio-2-mini (7B), achieved listening accuracy more than double that of a larger AI model, LLaMA-Omni2-32B-Bilingual. This finding suggests that well-designed smaller AI models can rival much larger ones in AI assistant performance.

The study also found that proprietary AI models do not consistently outperform open-source alternatives in AI assistant evaluations. This challenges the notion that proprietary AI models always offer superior performance. Moreover, visual interpretation accounts for half of all viewing mistakes in AI assistants, indicating a significant area for improvement.

The increasing sophistication of AI assistants demands robust evaluation methods. VoiceAssistant-Eval, with its diverse task categories and careful curation, aims to fill this gap. Despite AI assistants' speaking prowess, they still face challenges in understanding audio and processing combined inputs. Further research and development are needed to enhance these capabilities.

Read also:

Latest