Large language models (LLMs) have proven capable of generating new research ideas at the expert level. Moreover, according to a new study, these ideas turned out to be more original and interesting than those proposed by experts. This calls into question the uniqueness of human intelligence in the field of scientific innovation and opens new horizons for the development of AI in the scientific community.
Advances in large language models have sparked a wave of enthusiasm among researchers. It turns out that AI models such as OpenAI’s ChatGPT and Anthropic’s Claude are capable of independently generating and confirming new scientific hypotheses. It was believed that the creation of new knowledge and making scientific discoveries are the exclusive prerogative of humans, in contrast to the mechanical combination of AI knowledge from training data. However, having already supplanted humans in the areas of artistic expression, music, and programming, AI has now taken aim at science, showing the ability to generate research ideas that are, on average, newer than those proposed by scientists.
To test this hypothesis, research was conducted in the field of natural language processing (NLP). NLP is a field of AI that deals with communication between humans and AI in a language that both parties understand. It covers not only basic syntax, but also the nuances of language, understanding of context, and, more recently, even verbal tone and emotional nuances of speech. The study involved 100 NLP experts (PhDs and doctors from 36 different institutes), who entered into a kind of scientific competition with “idea agents” based on LLM. The goal was to find out whose research ideas would be more original, interesting and feasible.
To ensure the integrity of the experiment, 49 experts formulated ideas on 7 specific topics in the field of NLP, while an AI specially trained by the researchers generated ideas on the same topics. To motivate brainstormers to produce quality ideas, $300 was paid for each concept the experts proposed, and each of the top five human ideas received an additional $1,000. Once the project was completed, LLM was used to standardize the writing styles of each paper while maintaining the original content to even the odds and make the research as unbiased as possible.
All submitted papers were then reviewed by 79 external experts who blindly assessed all research ideas. The panel of experts submitted 298 reviews, giving each idea two to four independent reviews. The results were amazing. AI-generated ideas received statistically significantly higher ratings for novelty and excitement compared to human ideas. However, AI ideas were slightly lower in feasibility and slightly higher in effectiveness than human ideas, although these differences were not statistically significant.
The study also revealed some shortcomings in AI performance, such as a lack of diversity of ideas. Even with clear instructions not to repeat themselves, the AI quickly forgot about it. Additionally, the AI was unable to consistently test and evaluate ideas and received low scores for agreeing with human judgments. It is important to note that the study also revealed certain limitations in the methodology. In particular, assessing the “originality” of an idea, even by a group of experts, remains subjective, so it is planned to conduct a more comprehensive study in which ideas generated by both AI and humans will be fully formalized into projects, which will allow for a more in-depth study of their impact in real life. scenarios. However, the first results of the study are certainly impressive.
Today, when AI models, although becoming incredibly powerful tools, they still suffer from their unreliability and tendency to “hallucinate,” which in the context of a scientific approach that requires absolute accuracy and reliability of information becomes critical. By some estimates, at least 10% of scientific papers are now co-authored by AI. On the other hand, do not underestimate the potential of AI to accelerate progress in some areas of human activity. A striking example of this is DeepMind’s GNoME system, which in a few months has achieved the equivalent of about 800 years of research in materials science, generating the structure of about 380,000 new inorganic crystals, capable of revolutionizing a variety of fields.
AI is now the fastest growing technology humanity has ever seen, and so it is reasonable to expect that many of its shortcomings will be corrected within the next couple of years. Many AI researchers believe that humanity is approaching the birth of general superintelligence—the point at which general-purpose AI will surpass human expertise in virtually every field. The ability of AI to generate more original and exciting ideas than scientists can lead to a rethinking of the process of scientific discovery and the role of humans in it.