As of: February 23, 2024, 10:13 a.m
By: Sven Trautwein
Comments
Press
Split
Thousands of books served as the text basis for software that is used, for example, in ChatGPT.
Above all, titles from Penguin Randomhouse.
Prominent authors, including Margaret Atwood, Stephen King and Sarah Silverman, have sued in recent months against the use of their texts as the basis for software models used, for example, by ChatGPT.
On the American side, around 8,000 authors joined them.
But which publishers does this affect?
A search of the Books3 database, which serves as the basis for OpenAI's LLaMA and ChatGPT metas, shows that certain publishers top these rankings.
Searched over 70,000 e-books
More than 70,000 e-books were used without permission to feed texts to language models for artificial intelligence.
© Jonathan Raa/Imago
Peter Schoppert, managing director of NUS Press, has dealt a little with the data sets.
With further help, he focused on around 72,000 e-books, which were searched by author name, publisher name and ISBN.
According to the online magazine
AI and Copyright,
English-language e-books served as the basis.
According to Schoppert, the evaluation produced an interesting picture.
Stay up to date on new releases and book tips with the free newsletter from our partner 24books.de.
Penguin Randomhouse and Harper Collins at Nos. 1 and 2
The publisher with the largest number of e-book titles in this filtered list is Penguin Publishing Group with 6,866 ISBNs, followed by Harper Collins with around 5,800 titles and Random House Publishing with around 3,400 ISBNs.
The current evaluation can be viewed here.
According to Schoppert, university publishers have not been spared either.
Columbia University Press appears on the list with 899 titles, ahead of Yale University Press with 554 and Princeton UP with 376 titles.
According to Schoppert, this shows that the assumption that the texts used to train the software were mainly Wikipedia and Reddit entries, as well as millions of words from the Internet, is wrong.
My news
German Children's Book Prize 2023: The ten most beautiful children's books to read
Ferdinand von Schirach's new publication “She says.
He says.” will be published at the end of February
Amazon Prime: “Harry Potter” and other series will no longer be available from March read
Literary highlights: The most popular bestsellers of 2023 to devour
Jussi Adler-Olsen: Readings are canceled due to illness
King of Horror: Eight Novels by Stephen King You Should Read
More than 72,000 illegal e-books
More than 72,000 pieces of illegally copied e-book content used to train Large Language Models (LLMs) were found.
Copyright fell by the wayside here.
Recently, horror writer Stephen King addressed readers in an article in
The Atlantic
saying that he had not given permission for his texts to be used.
The Authors Guild, America's oldest and largest professional organization for writers, recently adjusted its publishing author contract.
An addition now prohibits training the software with these texts.
But whether AI companies will adhere to this remains an open question, according to
AI and Copyright
.
In the past, they had also made use of pirated content.
Recently, authors including Stephen King have achieved partial success.
A small database called “Prosecraft” has been taken offline.
We have put together books on the subject of artificial intelligence that shed more light on the topic here.