Ethical AI Training: New Model Developed Using Public Domain Data

AI companies claim their tools couldn’t exist without training on copyrighted material. It turns out, they could — it’s just really hard. To prove it, AI researchers trained a new model that is less powerful but much more ethical. This was achieved by utilizing a dataset composed solely of public domain and openly licensed material.

Collaboration Among Institutions

The paper (via The Washington Post) was a collaboration involving 14 different institutions. The authors represent prestigious universities such as MIT, Carnegie Mellon, and the University of Toronto. Additionally, nonprofits like the Vector Institute and the Allen Institute for AI contributed to this significant project.

Building an Ethically-Sourced Dataset

The group built an 8 TB ethically-sourced dataset, which included a collection of 130,000 books from the Library of Congress. After inputting this material, they trained a seven-billion-parameter large language model (LLM) on the dataset. The result? It performed comparably to Meta’s similarly sized Llama 2-7B from 2023. However, the team did not publish benchmarks comparing its results to today’s leading models.

Challenges in Training the Model

Performance comparable to a two-year-old model was not the only downside. The process of assembling the dataset was also labor-intensive. Much of the data was not machine-readable, necessitating human intervention to sift through it. “We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people,” co-author Stella Biderman told WaPo. “And that’s just really hard.” Additionally, navigating the legal complexities of the data proved challenging, as the team had to determine which license applied to each website they scanned.

Implications of a Less Powerful LLM

So, what do you do with a less powerful LLM that is much harder to train? If nothing else, it can serve as a counterpoint to the prevailing narrative in the AI industry.

Industry Perspectives on Copyrighted Materials

In 2024, OpenAI informed a British parliamentary committee that such a model essentially couldn’t exist. The company asserted that it would be “impossible to train today’s leading AI models without using copyrighted materials.” Last year, an expert witness from Anthropic added, “LLMs would likely not exist if AI firms were required to license the works in their training datasets.” This highlights the ongoing debate regarding the ethical implications of using copyrighted materials in AI training.

Future of AI Training Practices

While this study may not significantly alter the trajectory of AI companies, it does challenge one of the industry’s common arguments. The notion that ethical AI training is unfeasible is now punctured, although the industry may continue to prioritize efficiency over ethical considerations. Don’t be surprised if you hear about this study again in legal cases and regulation arguments.

This article originally appeared on Engadget at Engadget.

Source: Original Article