Meta✴ CEO Mark Zuckerberg personally authorized the Meta✴ division responsible for developing Llama artificial intelligence models to use a data set containing illegally obtained books and articles to train them. This became known from documents published as part of the lawsuit of writer Richard Kadrey against Meta✴.
The case is just one of a number of cases in which tech giants that develop AI systems are accused of training models on copyrighted material without the authors’ permission. Defendants have traditionally argued that their actions meet the fair use standard, a doctrine that allows copyright to be overridden to create new works and products that are substantially different from the original. Many copyright holders do not agree with this position.
A new batch of declassified documents (PDF) provides testimony from Meta✴ representatives: it turned out that Mark Zuckerberg personally approved the company’s use of the LibGen array to train Llama. The LibGen project, which bills itself as a link aggregator, actually provides access to copyrighted works operated by major publishers. He was repeatedly sued, tens of millions of dollars were recovered from him for copyright violations, and as a result the project was forced to close. Zuckerberg, the documents say, approved the use of LibGen to train at least one Llama model, despite concerns raised by Meta✴ employees and management. An internal memo is cited that notes that LibGen’s work was approved after “escalation to MZ,” an acronym that apparently meant the head of the company.
The plaintiff’s side filed a statement with the court on January 8 containing new charges. In particular, it is alleged that Meta✴ could try to hide this act and remove information about the use of LibGen materials – this was allegedly done by Meta✴ engineer Nikolay Bashlykov, who wrote a script that removed copyright information from books in the training array. Meta✴ also allegedly removed copyright notices and related metadata from scientific journal articles in the dataset. Moreover, Meta✴ violated copyright by downloading the LibGen array via the BitTorrent protocol – at this moment the company not only downloaded, but also simultaneously “distributed” this data, actually distributing pirated materials, the plaintiff claims. The head of generative AI at Meta✴, Ahmad Al-Dahle, gave permission to download LibGen data via BitTorrent, although engineer Bashlykov indicated that this “may not be legally permissible.”
The case is still far from over. For now, it only applies to early Llama models, not the latest releases. And if Meta✴ convinces the court of fair use of the materials, it may side with the company – in 2023, several plaintiffs were unable to prove copyright infringement, and their claims against Meta✴ were rejected.