By Gary Symons
TLL Editor in Chief
(Disclosure: The writer of this article was formerly a member of the Investigative Unit at CBC News, ending in 2008)
An investigation by CBC News says the work of Canadian authors were shared online as part of a dataset used for training artificial intelligence software.
The Canadian broadcaster found that at least 2,500 copyrighted books written by more than 1,200 Canadian authors were used in this way, which many authors and lawyers say was a potential violation of their copyright protection.
The existence of the dataset was actually revealed earlier this year in The Atlantic magazine, which created a wave of indignant protest by writers concerned the dataset could be used to create new work in their distinct writing style.
CBC News dug deeper into the dataset, known as Books3, to find out which authors and books were included from Canadians. According to CBC, more than 1,200 of the country’s top authors were affected, including more than a third of all Governor General’s Literary Award finalists, several nominees to the prestigious Giller Award, as well as three-quarters of the authors who were featured in CBC’s own Canada Reads competition.
Globally, the Atlantic and CBC found that roughly 190,000 files from authors around the world were included in the Books3 database.
The list of authors was topped by Margaret Atwood, arguably Canada’s most famous writer, whose work The Handmaid’s Tale became a popular and critically acclaimed TV series. Forty-one of Atwood’s books were found in the dataset, as well as several by the children’s author Gordon Korman and Nobel Prize winner Alice Munro.
Books3 dataset was created by a non-profit artificial intelligence research lab called EleutherAI, that launched it so AI creators could access “OpenAI-grade training data at your fingertips.”
While EleutherAI does face at least one lawsuit, it does appear from the organization’s website that the data is not being used for its own commercial purposes, but to aid in the research and development of language-based training models for AI programs. When The Atlantic published its original story on the existence of the dataset, EleutherAI developer Shawn Presser defiantly argued that his work was meant to advance science, and allow others to develop AI programs. “I would gladly go to prison … for advancing science and giving you the power to replicate ChatGPT,” Presser said in a tweet at the time.
While Presser says there isn’t really a copyright issue for the dataset, that’s not the view of the Writer’s Union of Canada (WUC), which represents more than 2,600 professional writers. WUC’s ire was raised after CBC News discovered that one-sixth of its members had at least one of their books included in the dataset.
“It’s huge. It’s incredibly impactful on the cultural economy in a negative way. And, as importantly, it’s unbelievably disrespectful,” said John Degen, WUC’s executive director. “No one asked for permission. No one explained the project. To me, that’s inexcusable and needs to be addressed legally and by Parliament.”
Degen argues unauthorized inclusion of these works in the Books3 dataset is a violation of Canadian copyright law. “Copyright can be very abstract and hard to understand, but I don’t think that taking a pirated book from a pirate site and using it for your own industrial purposes, I don’t think that it’s hard to understand that that’s wrong,” said Degen, adding the organization may file its own lawsuit.
However, the outcome of a lawsuit is certainly not clear. CBC interviewed Osgoode Hall Law School professor Carys Craig, who specializes in intellectual property law and technology, who says it’s debatable if the existence or use of Books3 is illegal under Canada’s copyright law.
“It’s not clear that the inclusion of works in a dataset used to train a generative AI model does constitute copyright infringement,” said Craig. “Even if it’s done without the consent of the rights holder, it’s not clear that it implicates copyright at all.”
In fact, Craig wants the Canadian government to make it easier under copyright law for AI researchers to use text and data without fear of being sued, in order to speed that country’s development of AI models.
“It’s simply unrealistic to imagine that permissions are going to be sought from every individual author whose work appears there,” she argues.
While this is likely the first major conflict over AI in Canada, there are multiple lawsuits ongoing in the United States and globally, and the Books3 database is mentioned in a number of those cases. The Authors Guild, for example, has filed a class action suit, while individual writers have filed their own specific cases.
One of those involves the author, actor and comedian Sarah Silverman, who says copyright infringement by AI bots is no laughing matter.
Silverman is one of three people filing lawsuits charging that AI bots were trained on their work without their permission, resulting in AI replicating their work. The other two plaintiffs are authors Christopher Golden and Richard Kadrey.
The basis for their claim is similar to those made recently by multiple artists. Tools like the popular ChatGPT are trained on large language models that are fed huge amounts of data taken from the internet in order to train them to give convincing responses to text prompts from users. In the case of ChatGPT, that would include books, articles, essays, and other types of text documents that were originally produced by human beings.
TLL does not use AI bots to produce articles, but when The Licensing Letter was testing ChatGPT’s capabilities, we asked the AI bot to write a short press release in the writing style of children’s writer Dr. Seuss, which it did in under 30 seconds. That would not be possible unless the AI bot was previously trained on written material from the writer.
The lawsuit against OpenAI claims the authors “did not consent to the use of their copyrighted books as training material for ChatGPT. Nonetheless, their copyrighted materials were ingested and used to train ChatGPT.”
Licensing Law: “This is the Final Straw, AI,” Furious Drake Says
Similarly, the lawsuit against Meta makes the claim that the authors’ books appear in the vast dataset that was used to the train Meta’s LLaMA, which is a group of AI models.
Both suits also claim that their works were obtained without their permission from what they call ‘shadow library’ that have been widely used by the AI research community to develop the language skills of their AI bots.
So, how did the plaintiffs discover their works had been used? Ironically, by using the AI bots themselves.
The OpenAI suit includes exhibits of evidence that claim the bots were prompted to produce summaries of three books, which it successfully accomplished, those being Silverman’s The Bedwetter, Ararat by Golden, and Kadrey’s Sandman Slim. The Meta suit likewise cites multiple works by the authors being used by their bots, as well as a research paper from Meta AI called “Open and Efficient Foundation Language Models”that showed the system’s training data sets included material taken from the shadow libraries in a manner that the claimants described as “flagrantly illegal.”
The lawyers representing the three plaintiffs are Joseph Saveri and Matthew Butterick, who have said they have been hearing from several writers and publishers that they’re concerned about the ability of AI bots to generate text that is extremely similar to their copyrighted works.
Why You Need to Watch the Getty Images Lawsuit Against Stability AI
At the same time, as previously reported in TLL, the stock photo company Getty Images is suing the generative image AI company Stable Diffusion in the U.S. and the UK over copyright infringement, alleging the bot is trained using images drawn from its massive image library. Just this week, a UK court ruled the lawsuit can go to trial, after finding merit in Getty’s insistence that its copyrighted material was used to train AI models.
Stability Diffusion argued the case should not be heard in a UK court because it said no one involved in the training or development of Stable Diffusion was based in the UK. The company also said it trained the model using US-based cloud computing power from AWS.
However, Justice Joanna Smith decided evidence from the company and Stability CEO Emad Mostaque “raise the specter that evidence is either inaccurate or incomplete; at the very least suggest a conflict of evidence.”
On the other hand, the creators of AI have also won victories in the courts. Last month a judge in California initially dismissed five of the six allegations in a lawsuit concerning LLaMA, another AI-training dataset owned by Facebook’s Meta. The ruling states that based on the current allegations, the dataset does not constitute “a recasting or adaptation of any of the plaintiffs’ books,” but is rather used only for the purposes of teaching an AI how to better use language.
Back in Canada, the federal government has launched its second consultation on the implications of generative AI for copyright, and will consider if copyright laws should be changed.