Not just NVIDIA accelerators: data tagging is becoming one of the main reasons for the rising cost of AI models

Building and training powerful AI models can cost companies hundreds of millions to billions of dollars a year. For example, OpenAI intends to spend up to $7 billion for these purposes in 2024. The bulk of the costs are on hardware resources, including expensive NVIDIA accelerators. But as Fortune reports, there’s another important expense that’s often overlooked: the need for quality data labeling. Meanwhile, it is this work that requires increasingly large financial investments.

Tagging (or tagging) is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context. This is necessary so that the AI ​​model can learn from such amounts of information. Data tagging is required for a variety of use cases, including computer vision, natural language processing, and speech recognition.

Markup has long been used, for example, in developing AI models for self-driving cars. The camera captures images of people, street signs, vehicles and traffic lights, and human annotators tag the images with tags such as “pedestrian,” “truck,” or “stop sign.” This is a labor-intensive and painstaking process that takes a lot of time and requires significant financial investments. Following the release of ChatGPT in 2022, OpenAI was widely criticized for outsourcing such work: the company hired Kenyan workers for less than $2/hour.

Current general-purpose large language models (LLMs) undergo reinforcement learning from feedback (RLHF). During the procedure, humans provide qualitative feedback or rank what the AI ​​model generates. This approach leads to a significant increase in costs. Another reason for the rising costs of data labeling is the desire of companies to include corporate information, such as customer information or internal corporate documents, in the training process.

In addition, labeling expert-level data in areas such as law, finance and healthcare requires the involvement of highly qualified specialists, whose salaries are very expensive. That’s why some developers are outsourcing data labeling tasks to third-party companies, like Scale AI, which recently received $1 billion in funding.

Alex Ratner, CEO of data labeling startup Snorkel AI, says enterprise clients can spend millions of dollars labeling and processing information. Such operations in some cases take up to 80% of the time and budget for AI. Moreover, to maintain relevance over time, the data must be periodically supplemented and processed anew.

Thus, marking, along with the need to use expensive equipment, becomes one of the main cost items when training AI models. Some companies reduce costs by using synthetic data—that is, data generated by the AI ​​itself. Recent innovations in the field of AI have made the generation of synthetic data efficient and fast, which in some cases makes it possible to abandon the use of arrays of real information. However, in some cases this threatens “self-repetition”.

admin

Share
Published by
admin

Recent Posts

Telegram will begin to disclose the IP addresses and phone numbers of criminals to law enforcement agencies

Telegram's flexible search capabilities allow users to easily find public channels and bots. Unfortunately, the…

16 mins ago

Windows games may soon be coming to Linux Arm devices as Valve tests software

Image Source: Warner Bros Interactive Also, the SteamDB website currently lists a large number of…

16 mins ago

Automotive companies lag behind Tesla and Chinese competitors in developing modern software

Global automakers from Toyota and Volkswagen to General Motors are falling further behind Tesla and…

1 hour ago

YouTube is raising Premium subscription prices again—in some cases by 50%

YouTube has announced a significant price increase for its Premium subscription. In some countries the…

1 hour ago

Alibaba Cloud Reduces Data Center Assembly Time by 50% Using Modular Architecture

Alibaba Cloud presented at its annual Apsara conference a modular data center architecture called “CUBE…

2 hours ago