Chinese startup DeepSeek made headlines earlier this year when it released its R1 reasoning model, which was able to compete with AI models from American tech giants despite its modest budget. Now, DeepSeek has published a paper in collaboration with researchers at Tsinghua University detailing a new approach to training reinforcement models that can significantly improve their performance, SCMP reported.

Image source: Solen Feyissa/unsplash.com

According to the paper, the new method aims to help AI models better match human preferences by using a reward mechanism for more accurate and understandable answers. Reinforcement learning has proven effective in accelerating AI tasks in limited domains and applications. However, its use for more general tasks has proven less effective. The DeepSeek team is trying to solve this issue by combining generative reward modeling (GRM) and so-called principle-based self-criticism tuning. As claimed in the paper, the new approach to improving the reasoning capabilities of large language models (LLM) outperformed existing methods, as confirmed by model validation in various benchmarks, and achieved the highest performance for general queries while using fewer computing resources.

The new models are called DeepSeek-GRM, short for Generalist Reward Modeling. The company said the new models will be open source, but has not announced a release date. Last month, Reuters reported, citing people familiar with the matter, that the company will also release DeepSeek-R2, a successor to the R1 reasoning model, in April.

Other leading AI developers, including China’s Alibaba Group Holding and San Francisco-based OpenAI, are also working to improve the reasoning and self-improvement capabilities of AI models, Bloomberg noted.

Leave a Reply

Your email address will not be published. Required fields are marked *