The Limited Times

Now you can see non-English news...

Wikipedia turns 20: "If Google were to tackle the project, it would fail gloriously"

2021-01-17T18:34:44.528Z


Wikipedia contains 50 million articles. But it could get much bigger and better, says one of its employees. Computer scientist Denny Vrandečić is working on a universal language of facts.


Icon: enlarge

Wikipedia: "There will probably be a lot of resistance and discussion"

Photo: Sebastian Gollnow / dpa

Wikipedia is 20 years old.

The online lexicon now includes more than 50 million articles.

If you were to translate the different language versions with one another, it could be much bigger.

The German Wikipedia, for example, contains around 2.5 million articles, but not even half of them appear in the English version, says the computer scientist and philosopher Zdenko "Denny" Vrandečić.

With the »Abstract Wikipedia« he is developing a kind of universal language for automatic translation between all 300 Wikipedia versions in the world.

He spoke about this with SPIEGEL in a Zoom interview from his home office in Berkeley, California.

SPIEGEL:

20 years after it was founded, Wikipedia is one of the most popular websites in the world.

Why is an »Abstract Wikipedia« needed?

Vrandečić:

I once compared different Wikipedia: who is the mayor of San Francisco?

It was totally confused, only a few versions recorded the current mayor, but at least most of them named a mayor who used to exist.

Much wasn't wrong, but completely out of date.

Not for political reasons, but simply because updating is work.

Our translation project could help here.

Ideally, we would not change the contradictions automatically, but make them visible, then the community could work through it more easily.

SPIEGEL:

When will the first translation results be available?

Vrandečić:

We just have to try it out.

We'll know more soon, maybe around 2022 or maybe 2023.

SPIEGEL:

How would such a translation work?

Vrandečić:

We want to formulate Wikipedia

entries in

such a way that they are independent of a specific natural language.

And from this abstract representation, which could look a bit like a programming language, we want to generate entries in English and German and other natural languages.

SPIEGEL:

How can I imagine that in concrete terms?

Vrandečić:

You can imagine our translation project a little like the difference between a formula and a sentence.

For example, take a math expression like "50%".

This abstract term »50%« could easily be expressed in different languages ​​by saying: »Every second«, or »Half« or »Half« or »la Moitié«.

The abstract content would always be the same, although the target languages ​​are different.

These entries would then be available to the local Wikipedia to enrich their content.

SPIEGEL:

Won't there also be resistance from tens of thousands of volunteers to such a project for automatic text generation?

They could see their work and competence threatened by an opaque AI juggernaut that spits out texts without their intervention.

Vrandečić:

Yes, there will probably be a lot of resistance and discussion.

But that's a good thing.

We have to discuss with the community where the new technologies can best be used.

SPIEGEL:

The Wikipedian Heather Ford from the University of Leeds wrote to me: "If we don't have the abstract Wikipedia developed by representatives of the smaller language groups themselves, it will tend to exacerbate the problems of inequality."

Vrandečić:

I agree, and that's why it is so important for the project that the content of the Abstract Wikipedia is actually contributed by the global community.

That a contribution to the Amharic culture comes from Amharas and that Bengali also write about Bengali dances.

Everyone must have the opportunity to contribute to the content of the Abstract Wikipedia.

Icon: enlarge

Traditional Indian dance in the Indian state of Kerala (2019)

Photo: DIBYANGSHU SARKAR / AFP

SPIEGEL:

The biologist and Wikipedian Ian Ramjohn from the University of Michigan also warns: “Algorithms tend to reproduce the unconscious prejudices of the people who program them.

Artificial intelligence is not immune to it either.

But now the worldview of men from industrialized nations is already very strongly represented in Wikipedia, and it is precisely these people who in turn would program the translation software. " 

Vrandečić:

That is why we work with a rule-based and function-based system in which the contributors retain full control over the content and its presentation and are not dependent on the prejudices learned and incorporated into language models.

SPIEGEL:

Your doctoral supervisor Rudi Studer from KIT writes to me: “The Abstract Wikipedia is a very ambitious project with diverse challenges.

For example, it is difficult to find abstract structures that are general enough to capture the different linguistic aspects from the many different languages. "

Vrandečić:

It's true, every time I presented my project to researchers, the immediate reaction came: It's completely impossible.

But the longer we talk, the more they say: Yes, the individual steps are understandable.

I don't see why it shouldn't work. 

SPIEGEL:

How much basic research is still missing?

Vrandečić:

From the outside, our translation project sounds incredibly ambitious.

But when you get into it, you notice: We're not tinkering with science fiction technology, we're just cooking with water.

Most of it is well researched, we are simply re-applying software that has been around for a long time.

But our project is still highly risky.

We have no idea which of these will work at all.

Mr. Spock, played by a double in London (2015)

Photo: Neil Hall / REUTERS

SPIEGEL:

Would abstract Wikipedia have any other advantages than just exchanging language versions?

Vrandečić:

Yes.

Wikipedia is much too demanding, the language level is too high, studies have shown.

That's why there is also the simpler version with Simple English.

But even that is too complicated for many people, especially if English is not their mother tongue.

I was out and about in the park with my daughter and once looked at the entry on the subject of "daisies".

I've only got half of it because I don't have a doctorate in biology.

A simpler, clearer Wikipedia would be desirable.

It could be that our translation project could help.

Icon: enlarge

Daisies in an allotment garden in Frankfurt (2015)

Photo: Patrick Pleul / dpa

SPIEGEL:

Why don't you just use Deepl or Google Translate for the translation?

Vrandečić:

Most machine learning projects simply rely on huge amounts of text as input, from which they can then quickly and cheaply produce huge amounts of text as output.

This is inexpensive, but also prone to errors.

This can easily lead to bizarre errors, so-called "hallucinating": Neural networks sometimes spit out completely absurd texts, simply because text passages are assigned incorrectly.

We wouldn't have this problem because our system is based on abstract fact coding that doesn't rely on mass of text, but on precision.

Our project is much more demanding, we rely on abstract language in the background.

Icon: enlarge

Jimmy Wales, co-founder of Wikipedia (2011)

Photo: Fabian Bimmer / dpa

SPIEGEL:

Is this approach completely new?

Vrandečić:

No.

The principle is called »Rule Based Natural Language Generation«.

A well-known project started with the copier manufacturer Xerox, which had generated the operating instructions for their devices with such a rule-based system.

There are already frameworks that are pretty good.

The only thing where we are breaking new ground: We apply this to languages ​​for which these systems do not yet exist.

We want to expand it to 300 languages.

SPIEGEL:

You worked at Google until last year, why didn't you have your project there?

Vrandečić:

Working at Google was great.

As an employee, you have access to so many clever people.

That helped me a lot.

For a while I ran the project with my 20 percent of the working time that every employee has available for their own projects.

And then a year ago I switched to Google Research to concentrate fully on it.

But then I decided that the project should be located at Wikimedia.

If Google tackled the project, it would fail gloriously.

It worked really well with Google Maps, that a lot is happening with crowdsourcing.

But the project is in much better hands with the Wikidata Foundation. 

Icon: enlarge

Google Campus (2019)

Photo: Amy Osborne / AFP

SPIEGEL

: You published your proposal for an abstract Wikipedia on April 1, 2020 of all places.

April Fool's Day?

Vrandečić:

This is one of my favorite

dates

for new projects.

We also proposed Wikidata on the first of April.

This is also a tradition at Google; G-Mail, for example, was launched on the first of April.

On the first of April you can trust yourself more, the answers could be more interesting.

SPIEGEL:

How did you come to Wikipedia?

Vrandečić:

As a child, I was a role player.

I read Das Schwarze Auge (DSA) endlessly.

This is a classic pen-and-paper game that you simply play with a pen and paper.

I soon started writing my own DSA stories and experienced: Writing books is not somehow magic, but behind it are normal people like you and me.

Everyone can write.

Readers can also influence how a story develops, simply by taking notes.

It's similar with the new Wikimedia Abstract project.

SPIEGEL:

What were your first own entries on Wikipedia?

Vrandečić:

I co-founded the Croatian Wikipedia.

My parents come from the island of Brač in Croatia.

One of my early edits should have been by Brac when I was in my mid twenties.

My Croatian is not great.

In addition, the language has changed a lot in the last 20 years, it has political backgrounds, many terms are being redefined today in order to differentiate Croatian from Serbian.

Icon: enlarge

The port city of Bol on the Croatian island of Brač (2007)

Photo: Sheila Norman-Culp / AP

SPIEGEL:

Which languages ​​do you want to translate first using the Abstract Wikipedia?

Vrandečić:

I want to cover as many different language families as possible.

For example, I would like Arabic with me, or Hebrew or another Semitic language.

Chinese would be nice too, but it's difficult because of the political situation.

But an African language would be good.

Why not Amharic?

Or one of the more than 500 languages ​​in Nigeria, such as Hausa or Yoruba.

I could even imagine that we are including sign language.

Icon: The mirror

Source: spiegel

All tech articles on 2021-01-17

You may like

Life/Entertain 2024-02-21T14:43:22.535Z
News/Politics 2024-02-28T04:04:27.235Z

Trends 24h

Latest

© Communities 2019 - Privacy

The information on this site is from external sources that are not under our control.
The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.