Microsoft has expanded its line of large language models of artificial intelligence Phi-4 with two new projects with relatively modest system requirements. One of them is multimodal, that is, it works with several data formats.
Image source: microsoft.com
Microsoft’s Phi-4-mini is a text-only model, while Phi-4-multimodal is an improved version that can also handle visual and audio queries. Both models, the developer claims, significantly outperform comparable-sized alternatives in certain tasks.
Microsoft Phi-4-mini has 3.8 billion parameters, which means it is compact enough to run on mobile devices. The model is based on a special version of the Transformer architecture. In the standard version, transformer models analyze the text before and after each word to understand its meaning; when developing Phi-4-mini, Microsoft used a version of the Decoder-Only Transformer, which analyzes only the text preceding the word, which reduces the load on computing resources and increases the speed of data processing.
For additional optimization, the model uses Grouped Query Attention technology, which helps it determine which pieces of data are most relevant to the current task. Phi-4-mini can generate text, translate documents, and manage external applications; the model, according to its developers, excels at solving math problems and writing computer code, even when “complex reasoning” is required. Microsoft itself estimates that the accuracy of Phi-4-mini’s answers “significantly” exceeds the results of several other similarly sized models.
Phi-4-multimodal is an extended version of Phi-4-mini with 5.6 billion parameters; it accepts not only text, but also images, audio, and video as queries. To further train the model, Microsoft used the new Mixture of LoRAs method. Usually, adapting an AI to a new task requires changing its weights — configuration parameters that determine how it processes data. To make this task easier, the LoRA (Low-Rank Adaptation) method is used — a small number of new weights optimized for the task are added to the model to perform an unfamiliar task. The Mixture of LoRAs method adapts this mechanism to multimodal data processing: when developing Phi-4-multimodal, the original Phi-4-mini was supplemented with weights optimized for working with audio and video. As a result, Microsoft said, it was possible to soften some of the compromises associated with other approaches to building multimodal models.
In visual processing tests, Phi-4-multimodal scored 72 points, slightly behind leading models from OpenAI and Google. In simultaneous video and audio processing, it “far outperformed” Google’s Gemini-2.0 Flash and the open-source InternOmni. Phi-4-mini and Phi-4-multimodal are available on the Hugging Face platform under an MIT license, which allows commercial use.
Google is working on integrating a battery status indicator that will help you understand why…
American publisher and developer Electronic Arts, in a press release timed to coincide with the…
IBM plans to strengthen its position in the fast-growing AI market by increasing investment in…
IBM plans to strengthen its position in the fast-growing AI market by increasing investment in…
On Tuesday, May 6, 2025, the U.S. Federal Aviation Administration (FAA) gave SpaceX permission to…
Earlier this month, it emerged that the startup OpenAI had abandoned its original restructuring plan…