From Harry Potter to 1984, passing through Gone with the Wind and Beloved by the Nobel Prize winner for the literate Toni Morrison. There are 50 books, including many sci-fi and fantasy classics such as Harry Potter, that served to train ChatGpt's AI model. To find out the researcher David Bamman, of the University of Berkeley in California, who worked with a team of colleagues.
The discovery of Bamman was accidental. The researcher usually uses technology to build "algorithmic measuring devices for culture", in practice extracting data from classical literature on topics such as the relationships between the various characters of a novel. In this specific case, he was working on Jane Austen's Pride and Prejudice when she decided to turn her questions over to ChatGpt. And he found that the software was as accurate in its answers as if it had read it, but there was no way to understand how the chatbot knew what it knew because the inner workings of large language models are a black box. So Bamman and his team decided to become "data archaeologists." They questioned ChatGpt about the knowledge of various books and gave a score for each. The higher the score, the more likely that book was part of the software's dataset. Then they put together their findings in a research, which is reported by the Business Insider website.
The list of 50 novels that helped train ChatGpt — a small part of the chatbot's immense database — includes classics like Moby Dick, The Scarlet Letter, The Color Purple, What's Left of the Day, and Furore. But the books with the highest percentage of knowledge from the AI model are science fiction and fantasy books. At the top of the list are Harry Potter and the Sorcerer's Stone by J.K. Rowling and 1984 by George Orwell; to follow texts that have made history such as The Lord of the Rings, Fahrenheit 451, The Brave New World but also Neuromancer by William Gibson and The Android Hunter by Philip K. Dick who, ironically, were among the first to sound the alarm bell on artificial intelligence. And again: Game of Thrones, The Hitchhiker's Guide to the Galaxy, The Da Vinci Code. In the list of books assimilated by ChatGpt there are also a couple of novels in the 007 saga by Ian Fleming, while among the texts that ChatGpt knows less are The Shining and The Diaries of Bridget Jones.
"The sources on which these AI models have been trained will influence the type of models themselves and the values they present," notes Bamman who at the same time asks: "What happens when a bot devours narrative about all kinds of dark and dystopian worlds? How can this genre influence the behavior of these models in ways that don't involve literary or narrative things? There is a lot of work to be done in this regard. We do not yet have the answer to this question", concludes the researcher.
There are in fact known risks, on misinformation or on biased or distorted information, which can only be dissolved when artificial intelligence software programmers open their data sets. This is not currently the case with OpenAI, the company that launched ChatGpt, despite the fact that its boss Sam Altman at least publicly asked the US Congress to regulate its activity.