Less hamburgers and more paellas: this will be the Spanish ChatGPT announced by Pedro Sánchez

2024-03-01T05:14:36.408Z

Highlights: Spanish Government to create AI model trained in Spanish and co-official languages. It will be a basic AI model for companies and administrations to train for their specific needs. Without enormous computing power it is impossible to teach an AI to write. In Spain only a handful of groups are capable of programming a model of this type. It is feasible to think that before the end of 2024 we will have a GPT-3 model, says Pep Martorell, deputy director of the Barcelona Supercomputing Center (BSC)

The Government promotes an artificial intelligence model trained in official languages that allows the country's organizations to create their own 'chatbots'

ChatGPT, Gemini, Copilot and other AI-based apps created by large companies work well in Spanish.

This Sunday, however, President Pedro Sánchez announced in Barcelona that his government was going to collaborate to build “a great foundational model of artificial intelligence language trained in Spanish and the co-official languages, in open and transparent code.”

What new features and benefits does this initiative provide?

According to Government sources, they are for now only “in the project announcement phase”: the details about the personnel and financing that the plan will have will be known “soon”, without specifying the date.

EL PAÍS has consulted what the details of the project will be with the organizations that appear as collaborators in Sánchez's announcement, and also with experts who have participated in similar projects in Spain.

These are some of the characteristics that this foundational model of artificial intelligence (AI) made in Spain will have.

1. It will not be a general 'chatbot'

A “foundational” model does not mean that it is a general

chatbot

like ChatGPT, which requires long and expensive work with humans providing thousands of instructions.

So you won't be able to ask everything, nor will you have a page where the public can access it.

It will be a basic AI model for companies and administrations to train for their specific needs.

“This is the fundamental problem,” says Pep Martorell, deputy director of the Barcelona Supercomputing Center (BSC), an institution that Sánchez designated as one of those responsible for the project.

“If the administration wants to create a

chatbot

for primary care, for example, how would they do it?

About OpenAI?

That has many problems, with licenses, bias, data closure, language,” adds Martorell.

The founding model is the foundation on which each organization will make its “home” with AI.

It is easier for the creator of these foundations to be a nearby public organization, more obliged to transparency, than a Silicon Valley company: “A company will hardly use ChatGPT for depending on what tasks because it blows the mind,” says Marta Villegas, Technology leader. of Language in the BSC, in reference to the scandalous errors in their answers.

“There are situations for which you don't need as much and there is a lot of demand for models to adapt to a specific business and retrain it to answer questions about a car brand, a public service (how to pay the IBI, for example),” he adds. .

2. It won't be easy to do

The BSC and its recently released MareNostrum 5 supercomputer are a basic piece to create this model.

Without enormous computing power it is impossible to teach an AI to write.

In a country like Spain, without the public support of various administrations, it would not be possible to even try to create something like this: “It is something that we already see in several European countries, the public sector promotes that models be generated taking advantage of the resources of the large research centers,” says Martorell.

There is also a second problem: in Spain only a handful of groups are capable of programming a model of this type.

They are all in research centers or universities: “We are a handful of people capable of doing this,” says Germán Rigau, deputy director of HiTZ (Basque Research Center for Language Technologies), pioneers in Spain.

“Within AI it is something that not everyone knows how to do.

Only some centers do it and evaluate it,” he explains.

The HiTZ has just presented the largest language model made in Spain, which is in Basque and based on Llama, from Meta, which is open source.

Furthermore, it is difficult to maintain talent: “We motivate young people by telling them that this is a reference center, but many still go to Google, Amazon, Cohere or start their companies,” he adds.

All this does not necessarily imply that this joint effort will produce a next-generation model.

It is more likely that it is from a previous generation: “It is feasible to think that before the end of 2024 we will have a GPT-3 model in Spanish and the rest of the co-official languages available for companies,” says Martorell.

And when GPT-4, which is the standard now for ChatGPT?

“As soon as possible based on the data we collect and the capacity of MareNostrum 5 allows us,” he adds.

3. A lot of baseball, less football

Models like ChatGPT are already multilingual: it makes little sense not to add languages when training them, when they learn them and use them to translate.

But a language is not just its words, it is also the context and culture.

There are a lot of variables: tradition, leisure, cuisine, sports.

This whole context is not only culture, but also the meaning of proverbs or idioms that only make sense in one language, which are untranslatable.

With Spanish, a language widely represented on the Internet, it is relatively easy to achieve good quality.

Even with Catalan.

But the millions of texts (called

“corpus”

) that are used to train Galician or Basque are much smaller, explains Rigau: “In Basque we have 4,000 million

tokens

[small blocks of text that machines use to understand the language. ].

The Catalan will have about 20,000 million, five times more.

The Spanish will have 250,000, ten times more than the Catalan.

That's all we've been able to hook on.

No matter how much we scratch, the scale is this.”

An objective of this founding model is to achieve a better

corpus

in the four co-official languages.

Each institution tries to close agreements with organizations that have created texts in their languages, from regional parliaments to television stations: “For our languages we have made a more curated search for content and an effort to gather non-conflicting data: Wikipedia, of course, but also data from regional parliaments, from TV3, Dialnet or the CSIC have allowed us to collect open magazines, also data from the Elcano Foundation and they left me a lot.

For Catalan, for example, we have data from Òmnium, Vilaweb, each group makes an effort within its language to obtain curated data,” says Villegas.

Similar work has been done with Galician and Basque.

All this effort would not only be intended for the model to respond in more correct Galician, but also for it to know better what it is talking about when referring to local issues: “A model from a large company will know a lot about the Superbowl and will be very Anglocentric.” says Villegas.

“It is not only from the point of view of language, but also from the implicit knowledge, from the model of the world,” he adds.

The Spanish model should have less baseball and hamburgers and more soccer and paellas.

When you have a larger

corpus

in a language, you have more information about the complex world that is described in that language.

Therefore, when it comes to analyzing clinical records written by doctors or legal rulings, it is essential that they are trained and in tune with the local language and content or they would lose too many nuances.

4. It is a strategic bet

Along with the linguistic and cultural needs of a country like Spain, there is the attempt to make a technological commitment.

“It is not just a sentimental, historical or cultural question,” says Senén Barro, professor at the University of Santiago de Compostela and director of its Singular Research Center in Intelligent Technologies.

“It's strategic.

If we are able to create a powerful industry in Spain of companies in linguistic technologies, they will not only be able to work for self-consumption but for the world, for example, in multilingual countries like this one.

It is a brutal market.

It is estimated that by the end of the decade the economy around linguistic technologies may be around 100 billion.

“It is a huge amount,” he adds.

It would be strange if many of the medical or legal data that Spanish administrations or companies need to use were available to American or Chinese technology.

“It must also be about sovereignty, it is about giving fabric to the industry,” says Rigau.

“Will we always depend on outsiders?

“There is a lot of sensitive data.”

5. The copyright problem persists

The initial difficulty of training such a model is to achieve billions of texts.

The most obvious place is the web.

The Common Crawl organization periodically collects everything on the internet.

Its objective is laudable, that this material is accessible to everyone, not just large technology companies: “Small companies or even individuals can access high-quality tracking data that was previously only available to large corporations,” they say on their page.

The data for this model made from Spain will also come from there.

In the Common Crawl archives there is the entire web: also graphics, pornography, absurd memes and, in all likelihood, copyrighted material, with rights.

Those in charge of the model clean up all biased, toxic or lewd references when training it, but copyrighted material is more delicate: “The fact that there are no copyright problems is complicated.

We take downloads from Common Crawl, which in the US is allowed under the protection of

fair use

,” says Villegas.

This “fair use” allows the use of material with rights for certain cases, such as education, citing in information or academia.

Its use to train AI models is still in legal dispute.

“These models do not make copies,” explains Rigau.

“It is something very complex, it is as if a person read a lot, 20 million books.

What do you remember about them?

This is the same.

Read, don't copy.

The machine's memory is not that good either: it invents things, it imagines them.

If you tell him the beginning of

Don Quixote,

he won't know how to continue.

He will know things, he will remember songs like anyone else.

It memorizes something, but it does not generate a complete work of anything,” she says.

You can follow

EL PAÍS Tecnología

on

Facebook

and

X

or sign up here to receive our

weekly newsletter

.

Subscribe to continue reading

Read without limits

Keep reading

I am already a subscriber

Source: elparis

All tech articles on 2024-03-01