Building and training powerful AI models can cost companies hundreds of millions to billions of dollars a year. For example, OpenAI intends to spend up to $7 billion for these purposes in 2024. The bulk of the costs are on hardware resources, including expensive NVIDIA accelerators. But as Fortune reports, there’s another important expense that’s often overlooked: the need for quality data labeling. Meanwhile, it is this work that requires increasingly large financial investments.
Tagging (or tagging) is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context. This is necessary so that the AI model can learn from such amounts of information. Data tagging is required for a variety of use cases, including computer vision, natural language processing, and speech recognition.
Markup has long been used, for example, in developing AI models for self-driving cars. The camera captures images of people, street signs, vehicles and traffic lights, and human annotators tag the images with tags such as “pedestrian,” “truck,” or “stop sign.” This is a labor-intensive and painstaking process that takes a lot of time and requires significant financial investments. Following the release of ChatGPT in 2022, OpenAI was widely criticized for outsourcing such work: the company hired Kenyan workers for less than $2/hour.
Current general-purpose large language models (LLMs) undergo reinforcement learning from feedback (RLHF). During the procedure, humans provide qualitative feedback or rank what the AI model generates. This approach leads to a significant increase in costs. Another reason for the rising costs of data labeling is the desire of companies to include corporate information, such as customer information or internal corporate documents, in the training process.
In addition, labeling expert-level data in areas such as law, finance and healthcare requires the involvement of highly qualified specialists, whose salaries are very expensive. That’s why some developers are outsourcing data labeling tasks to third-party companies, like Scale AI, which recently received $1 billion in funding.
Alex Ratner, CEO of data labeling startup Snorkel AI, says enterprise clients can spend millions of dollars labeling and processing information. Such operations in some cases take up to 80% of the time and budget for AI. Moreover, to maintain relevance over time, the data must be periodically supplemented and processed anew.
Thus, marking, along with the need to use expensive equipment, becomes one of the main cost items when training AI models. Some companies reduce costs by using synthetic data—that is, data generated by the AI itself. Recent innovations in the field of AI have made the generation of synthetic data efficient and fast, which in some cases makes it possible to abandon the use of arrays of real information. However, in some cases this threatens “self-repetition”.