At Google I/O 2025, the company announced a new version of its Gemini 2.5 multimodal model, which now supports real-time audio and dialogue generation. These capabilities are available in preview for developers through the Google AI Studio and Vertex AI platforms.
Image source: Google
Gemini 2.5 Flash Preview delivers realistic AI voice interactions, including emotional recognition, intonation and accent adaptation, and the ability to switch between more than 24 languages. The model can ignore background noise and use external tools such as Search to retrieve relevant information during a conversation.
Additionally, Gemini 2.5 offers advanced text-to-speech (TTS) features, allowing you to control the style, pace, and emotional expressiveness of voice-overs. It supports multi-voice dialog generation, making it suitable for creating podcasts, audiobooks, and other multimedia products.
To ensure transparency, all model-generated audio is tagged with SynthID technology, allowing content to be identified as AI-generated. Developers can try out the new features via the Stream and Generate Media tabs in Google AI Studio.
Gemini 2.5 represents a significant step forward in multimodal AI systems, integrating text, image, audio, and video modalities into a single platform. The new features open up broad prospects for interactive applications, virtual assistants, and educational innovations.