You can train AI models without copyrighted material

AI companies claim their tools Does not exist without training on copyrighted material. It turns out, they can do – it’s very difficult. To prove this, AI researchers trained a less powerful but very ethical new model. This is because the DataSet of LLM only uses a public domain and an open license material.
The Paper (By Washington Post) Cooperation between 14 different organizations. The authors refer to universities such as MIT, Cornegie Mellon and Toronto University. Even non -profit people like the Vector Institute and the Alen Institute for AI have cooperated.
The group built the 8TB morally sourced dataset. The data includes a set of 130,000 books in the Library of Congress. After inputting the material, they trained a seven-billion-parameter on a large language pattern (LLM) on the data. The result? It performed the same as well as as well as Lama 2-7 b From 2023. The team did not publish the benchmarks that compared its results with the top models today.
Comparable performance with a two -year model is not the only trouble. The process of keeping all of this is also a grind. Most data cannot be read by the machines, so humans had to siege through it. “We use automated tools, but all our things are manually quoted and people have been checked at the end of the day and people have checked,” said co -author Stella Biderman said Vapo. “And it is very difficult.” Identifying legal details also made this process difficult. The team had to decide which license applied to each website they had scanned.
So, what do you do with the most difficult less powerful LLM to train? If nothing, it can be used as a counterpoint.
In 2024, Opena He told the British Parliamentary Committee Such a model does not necessarily exist. “It is impossible to train today’s leading AI models without using copyrighted materials,” the company said. Last year, a human expert witness said, “AI organizations will not have LLMs if they need to license work in their training datasets.”
In fact, this study does not change the trajectory of AI companies. Above all, the more work to create less powerful tools does not live with their interests. But at least it punctures one of the industry’s general arguments. Don’t be surprised if you hear about this study again Legal cases And and Control arguments.