Last week, two authors sued OpenAi for training ChatGPT on their copyrighted books without permission and now famous comedian and actor Sarah Silverman joins the fray with a similar lawsuit.
Sarah Silverman and authors Richard Kadrey and Christopher Golden have followed with class action lawsuits filed in San Francisco federal court against Facebook parent company Meta and OpenAI, the company behind ChatGPT, funded by Microsoft. Their accusation is the same as another lawsuit filed last week, which alleges ChatGPT was trained on copyrighted books – not on public information like Wikipedia pages and marketing materials, as OpenAI claims.
From a Reuters report:
“Silverman, Kadrey and Golden allege Meta and OpenAI used their books without authorization to develop their so-called large language models, which their makers pitch as powerful tools for automating tasks by replicating human conversation.
In their lawsuit against Meta, the plaintiffs allege that leaked information about the company’s artificial intelligence business shows their work was used without permission.
The lawsuit against OpenAI alleges that summaries of the plaintiffs’ work generated by ChatGPT indicate the bot was trained on their copyrighted content.
“The summaries get some details wrong” but still show that ChatGPT “retains knowledge of particular works in the training dataset”.
Last week, Paul Tremblay, the authors of “The Cabin at the End of the World,” , and Monica Award, the author and “13 Ways of Looking at a Fat Girl,” filed a lawsuit against OpenAI saying that ChatGPT is capable of summarizing their books too accurately – which can only be possible if the AI in question was actually trained on the book itself, which is under copyright.
In their filed action, the authors accuse OpenAI that “their copyrighted materials were ingested and used to train ChatGPT”. They claim that the company has been open about using copyrighted books as training material, not just open source data, and actually used a dataset based on pirated books.
From their lawsuit:
“For instance, in its June 2018 paper introducing GPT-1 (called “Improving Language
Understanding by Generative Pre-Training”), OpenAI revealed that it trained GPT-1 on BookCorpus, a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” OpenAI confirmed why a dataset of books was so valuable:
“Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.” Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others.
BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of
AI researchers for the purpose of training language models. They copied the books from a website called Smashwords.com that hosts unpublished novels that are available to readers at no cost.
Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.”
So far, OpenAI has not released a statement in response.
Lead photo by Gage Skidmore via Wikimedia Commons.