The mathematics that produce and detect 'fake news'

2020-08-18T23:22:14.638Z

Automated text generation models use big data techniques to extract the most typical patterns of natural language and obtain (or identify) human-like results

Keyboard with a fake news folder.dimarik / Getty Images / iStockphoto

A few months ago, the Avaaz platform published a report that warned of the massive presence of undetected false news on social networks, in relation to covid. These contents, with the help of the media that viralize them, are causing another pandemic, which the WHO has called "infodemic", capable of causing all kinds of misunderstandings and deceptions regarding the virus. Furthermore, part of this news, despite its human appearance, is massively created using mathematical models of text generation based on artificial neural networks. However, the same ideas and mathematical models can also be used in the opposite direction, and are key in false content detection projects.

The problem of automatic text generation –that is, getting computers to speak or write coherently in natural languages , such as English or Spanish– is linked to the origins of the history of Informatics, since it allows the machine to and the human user to communicate easily. The first systems - such as the ELIZA chatbot (created in 1964) that emulated a psychologist, or the Racter software (1984), which produced one of the first novels written (almost) without human intervention - generated the sentences by applying a set of rules , called formal grammars .

The results, despite notable advances in this field over decades, were unconvincing. To achieve them, a paradigm shift in natural language processing was required, which came with the turn of the century and the processing of big data. Now, these new models, instead of requiring grammar rules entered manually, process huge amounts of text with big data techniques to learn the linguistic patterns for themselves. Thus, machines, although they do not understand language, are capable of repeating the most typical patterns that appear in natural languages.

According to the so-called distributional hypothesis, popularized by the linguist John Rubert Firth in the 50s of the last century, the meaning of a word is given by the other words that usually accompany it (its neighbors)

To do this, these systems start from the so-called distributional hypothesis , popularized by the linguist John Rubert Firth in the 1950s, according to which the meaning of a word is given by the other words that usually accompany it (its neighbors ). Imagine that, for example, we want a machine to extract the meaning of the word "dog", studying the presence on the Internet of three phrases: "dogs have muzzles"; "Dogs bark" and "dogs sew scarves." To do this, you could consider all the text available on the Internet (in Spanish) and see which of these phrases appear more frequently. Surely, the first two sentences are much more common than the third, that is to say, the word "dog" is usually accompanied by "snout" and "bark", and not by "sew" so, applying the distributional hypothesis, a dog it will be "something" that has a snout and barks, but does not sew.

In this way the language models (LM) work, and thus they learn the meanings of words, which are nothing more than frequent patterns of all the natural text considered by the machine. LMs are the basic components of current text generation systems, which generate sentences by predicting the next word, given a series of previous words, using ideas of probability and statistics. In the example above, the model will predict that after “the dog”, the probability that the word “barks” will appear is greater than that the word “sew” will appear.

Mathematically, these systems represent each word as a vector, the so-called word embedding , of about 300 dimensions. The most used system to do this is called word2vec. In this geometric space, similar words are close (thus, "dog" would be closer to "barking" than to "sewing") and also operations can be performed between them, or new ones can be generated. One of the most powerful models to date are the so-called GPT-2 and its successor GPT-3, from the OpenAI company, which generate texts of surprising quality. So much so that in 2019 they had to withdraw their fake news generation system for fear of misuse. Despite this precaution, today the use of models of this type for text generation is widespread and is not easy to detect. We suggest that readers try to guess, from these reviews of music products, which ones are legitimate and have been generated by a model similar to OpenAI. Hint: half are one type, and half another.

New models like GLTR try to identify even the most sophisticated automatic texts. They use mathematical tools that colorize words according to how likely they are

Against this, new models like GLTR try to identify even the most sophisticated automatic texts. They use mathematical tools similar to the previous ones, which categorize the words by colors according to how likely they are: in green (if they are within the 10 most plausible in that context, for that model), in yellow (top 100), in red ( top 1000) and the rest in purple. To evaluate whether a text is false, the model counts the number of words in each color: if the number of words in green is very high, it is very likely that the text has been generated by a machine, on the contrary, if in its Most are less likely red, yellow or purple words, it may have been written by a human.

According to recent results, the success of this tool is considerable: without it, the evaluators discriminate news generated by humans from those of machines with a 54.2% accuracy; with them the rate rises to 72.3%. However, surely when this article is published this data will have already changed: in the context of the infodemic, we are living an accelerated arms race to design, on the one hand, the best generative text models and, on the other, the corresponding detectors .

Victor Gallego and Alberto Redondo are predoctoral researchers at ICMAT. Ágata Timón G Longoria is responsible for communication and dissemination of the ICMAT

Café y Teoremas is a section dedicated to mathematics and the environment in which they are created, coordinated by the Institute of Mathematical Sciences (ICMAT), in which researchers and members of the center describe the latest advances in this discipline, share points of view encounter between mathematics and other social and cultural expressions and remember those who marked its development and knew how to transform coffee into theorems. The name evokes the definition of the Hungarian mathematician Alfred Rényi: "A mathematician is a machine that transforms coffee into theorems."

Editing and coordination: Ágata A. Timón García-Longoria (ICMAT)

You can follow MATERIA on Facebook , Twitter , Instagram or subscribe here to our newsletter

Source: elparis

All news articles on 2020-08-18

The mathematics that produce and detect 'fake news'

You may like

Trends 24h

Latest