The Wikimedia Foundation (the non-profit that runs Wikipedia) has proposed that instead of scraping Wikipedia content with bots, which drains its resources and overloads its servers with traffic, companies use a dataset specifically optimized for training AI models.
Image source: Oberon Copeland @veryinformed.com/unsplash.com
Wikimedia has announced a partnership with Kaggle, a leading platform for data scientists and machine learning owned by Google, to publish a beta version of a dataset of “structured Wikipedia content in English and French.”
According to Wikimedia, the dataset hosted by Kaggle was “designed with machine learning workflows in mind,” making it easy for AI developers to access machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis. The dataset’s contents are openly licensed. As of April 15, the dataset includes study summaries, abstracts, image links, infobox data, and article sections — but excludes links or non-written elements like audio files.
As Wikimedia reports, the “well-structured JSON representations of Wikipedia content” available to Kaggle users should be a more attractive alternative to “scraping or analyzing the raw text of articles.”
Wikimedia currently has content-sharing agreements with Google and the Internet Archive, but the partnership with Kaggle will make the data more accessible to smaller companies and independent data scientists. “As a go-to place for the machine learning community to get tools and benchmarks, Kaggle is excited to host the Wikimedia Foundation’s data,” said Brenda Flynn, communications lead at Kaggle.