The Limited Times

Now you can see non-English news...

Luciana Benotti, specialist in computational linguistics: “Data extraction for AI is a new colonization”

2024-01-25T05:40:13.423Z

Highlights: Linguistics expert: Data extraction for AI is a new form of colonization. Latin America is the least represented region in the world in the development of artificial intelligence. Most of the AI training data is in English or Chinese. The use of data to create new AI models could motivate the creation of non-Spanish-speaking countries to pay more taxes on the data they collect. The data could also be used to improve the quality of the data that is being collected by the new AI systems, which could lead to a better quality of data.


The Argentine researcher assures that large technology companies extract information from the same places where colonial powers once looked for slaves.


Artificial intelligence (AI) and technology giants use personal data of everyone, including the billions of citizens of less developed nations, without paying royalties or providing any benefits.

This is how Argentine researcher Luciana Benotti (San Francisco, 44 ​​years old) expresses it: “Very few of these companies leave wealth in the Spanish-speaking community, but they feed on the data that we produce in Spanish for free.”

Doctor in Computer Science with a specialty in Computational Linguistics, trained at the University of Comahue in Argentina, the Polytechnic of Madrid, the Bolzano in Italy and the Institut National de Recherche en Sciences et Technologies du Numériquee (INRIA) in France, maintains that the use of data at no cost is a new form of extractivist colonization.

“They are taken from where slaves were taken in the past and taken to where slaves were taken in the past,” she reflects.

The researcher at the National Scientific and Technical Research Council (Conicet) is the first Latin American president of the Pan-American Computational Linguistics Association - all the previous ones were North American - which brings together 5,000 researchers and language model developers from universities and large technology companies. like Google or Meta.

In addition, she collaborates with the Vía Libre Foundation for digital rights, and is a member of the steering committee of Khipu, a community of AI researchers and developers in Latin America.

Benotti was the only Latin American academic who participated in the last 2023 Security Summit on Artificial Intelligence in Bletchley, United Kingdom.

“Latin America was the least represented region in the world.

Spanish was not even on the list of seven languages ​​in which simultaneous translation was possible,” she comments.

Precisely, her research team develops a tool to detect social biases in Spanish language models.

Ask.

What is the participation of the Spanish-speaking world in the development of artificial intelligence tools?

Answer.

According to data from the Inter-American Development Bank from 2020, Latin America and the Caribbean is the worst represented region both in patents and in scientific articles that show participation in the development of artificial intelligence.

The large conference to be held in Singapore on natural language processing (NLP) and language models for generative artificial intelligence has more than 3,000 registered, 23 of which are from Spain and 13 from Spanish-speaking Latin America.

With respect to the Spanish language, an analysis of the research language of scientific articles from the last 10 years by language shows that Spanish is the eighth most studied language in the scientific community of computational linguistics, but far behind the dedication to English. , Chinese and German, which together represent more than 70% of the work.

Even when the AI ​​speaks in Spanish, it thinks in English or Chinese because most of its training data is in those languages

Q.

Does the AI ​​speak English?

A.

We can say that, even when the AI ​​speaks in Spanish, it thinks in English or Chinese because most of its training data is in those languages.

By that I mean that the

positionality

of the AI ​​is mostly that of a person born in countries where English or Chinese is spoken.

A person's positionality refers to the perspectives they hold as a result of their demographic characteristics, identity, and life experiences.

Recent work has found that most publicly available AI models align predominantly with white, college-educated, native English speakers from the Northern Hemisphere.

Q.

Is it a new way of colonization?

A.

Yes, it is a new form of extractive colonization.

Data is taken from where slaves were taken in the past and taken to where slaves were taken in the past.

Oil and other mining or intensive agriculture activities leave royalties, data extractivism does not, but it does use the time of the people who generate the data.

Q.

In what way can the Spanish-speaking world play a leading role in this field?

A.

Companies are not selling AI, but renting it by storing the data of their clients - companies and governments - on their computers, what they call the cloud.

This data generally becomes their property and can be used to train new AI models.

One way to start protecting our data or charging for it would be to reconsider the moratorium on customs taxes on data leaving Spanish-speaking countries so that big

tech

not only have to pay for the

hardware

but also for data, its raw material.

This could motivate the creation of companies or institutions that store data in the territories of Spanish speakers and take advantage of it to make AI.

Now it is impossible to compete with

big tech

.

Investment to improve diversity is non-existent compared to investment to strengthen the position of countries that already have a monopoly on artificial intelligence

Q.

So, it is very important that you position yourself…

A.

AI is already having an impact on the labor market and will have a greater one in the future.

The impact of these technologies on employment is an unavoidable issue.

An improvement in productivity should have a direct correlation with an improvement in working conditions and the quality of employment, with special attention to the most vulnerable populations.

But this is difficult to happen if the AI ​​used is imported.

Any transformation of the labor market must prioritize the problem of unemployment and precariousness with proactive and effective measures.

This is particularly important for our communities.

Q.

Is the cultural vision of Latin America excluded from AI?

A.

Companies like OpenAI, Meta, Google and others surely have access to data in Spanish, but we don't know how much or which ones.

We can only suspect that they may also be using our personal data that goes through the Las Toninas submarine cables every day every time we use WhatsApp, Google applications and the like.

With this data it is possible to develop language models such as ChatGPT.

These models have evolved into useful technologies with capabilities that did not exist a short time ago.

However, human behavior is inherently shaped by cultural contexts, some of which will be reflected in the data used to train NLP [natural language processing] models, but not completely.

Q.

Those models that seem to represent us, really don't?

A.

An important thing to note is that current language models, such as ChatGPT, are multilingual and include a majority of data in English or Chinese.

So their

positionality

, even if they speak Spanish, is generally that of someone from an English-speaking culture in the case of ChatGPT, or Chinese if we talk about Baidu's Ernie Bot.

Q.

Could we say then that there is no diversity in this area?

A.

That's right, there is no diversity.

Although these events [for the 2023 Security Summit on Artificial Intelligence in Bletchley] always repeat that diversity is important, the reality is that there are no concrete measures.

Investment to improve diversity is non-existent compared to investment to strengthen the position in this area of ​​countries that already have a monopoly.

There were more than 100 participants at the summit and it was intended to be a global meeting.

However, the representation of Spanish speakers was very limited.

Q.

What economic impact does this low Latin American participation in AI have?

A.

The AI ​​companies known as

big tech

are the richest right now.

AI is one of the most important sources of economic wealth in the world.

Six of the eight most valuable companies on the planet rely heavily on AI.

Very few of these companies leave wealth in the Spanish-speaking community, but they feed on the data we produce in Spanish, for free.

At the summit I mentioned the need for representation from the global South, but it had very little echo among the participants.

While everyone talked about the importance of diversity in the abstract, the only one I heard talk about the lack of representativeness of AI from the global south was China's Minister of Science.

You can follow

Planeta Futuro

on

X

,

Facebook

,

Instagram

and

TikTok

and subscribe

to our newsletter

here

.

Source: elparis

All news articles on 2024-01-25

Trends 24h

Latest

© Communities 2019 - Privacy

The information on this site is from external sources that are not under our control.
The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.