


Nvidia Release Massive AI-Ready Open European Language Dataset and Tools (siliconangle.com) 18
"Only a tiny fraction of the more than 7,000 languages on Earth are supported by artificial intelligence models," reported SiliconANGLE this week. So Nvidia announced "a massive new AI-ready dataset and models to support the development of high-quality AI translation for European languages."
The new dataset, named Granary, is a massive open-source corpus of multilingual audio, including more than a million hours of audio, plus 650,000 hours of speech recognition and 350,000 hours of speech translation. Nvidia's speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training... Granary includes 25 European languages, representing nearly all of the European Union's 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese. This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models... The team demonstrated in their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation.
Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset... The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary's current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.
Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset... The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary's current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.
No, in Europe the approach is a tad different (Score:2)
You learn to communicate in several languages.
English as an intermediary? (Score:2)
I wonder if they're doing direct from say, French to German or Spanish to Polish, or using English as an intermediate representation. And, of course, there's always the idempotent test (translate from A to B, then send the result to B to A translation.)
dave
Re:English as an intermediary? (Score:4, Informative)
Maybe the answer in their paper https://arxiv.org/html/2505.13... [arxiv.org]
Re: (Score:2)
I looked through the paper, and frankly I couldn't tell. But I'm not an AI/translation person, so I might have missed something.
But I do note their example is English Croatian. So that just reinforces my question about translation between two arbitrary European languages.
Re: (Score:2)
English-speaking countries have a lot of lawyers, and the language has adopted many evasive and ambiguous words & idioms to accommodate their needs.
Translating from A to English and then from English back to A often results in linguistic mush.
Too stupid to release an FOSS graphiics driver (Score:1)
NVIDA lots it way years ago. This doesn't change anything.
May their business go down the tubes just like their GPUs on FOSS.
Re: (Score:2)
My understanding is that they signed away their right to do that when they did the deal to be the graphics chip for the original Xbox. They have released an OSS driver for Tegra, which was based on different tech.
As for their GPUs on FOSS, they are still the standard for GPGPU because of CUDA, because AMD doesn't seem able to solve their problems with ROCm support for whole lines of GPUs.
This may help unfuck the EU (Score:3)
This may actually help unfuck the EU as a structure in one of the fundamental ways it's fucked. Comprehension across languages.
Just the bureaucratic translation apparatus between all languages in Brussels is a money black hole on its own, and this has a good chance of removing it. Beyond that, ability to actually communicate in main European languages across the board would be a very welcome thing, as a lot of written and spoken assets are just not available in most European languages at all due to fairly small pool of speakers.
It's not solution in itself, but it's a very good first step in the direction of solving the Tower of Babel problem within EU.
Re: (Score:2)
In practice, that's what they do.
Most scientific conferences are in English. International tradeshow exhibits are in English. Anyone working in a customer-facing tourist job is expected to speak English.
The same is true in Asia. There was a recent summit between Japan and Korea, who rarely speak each other's language. So the meetings were conducted in English, which is widely spoken in both countries.
Re: This may help unfuck the EU (Score:1)
Re: (Score:2)
It does seem that translation is just senseless busy work, with no real value add.
I mean, when it's all done, you've got 27 (or whatever) versions of a document that says the same thing (if translated correctly).
On the flip side, if everyone would just agree to pick one, ANY one language for all their official documents, all that extra work could be put into something that does add real value.
Re: (Score:2)
It's much worse. There's an army of interpreters doing live interpretation of meetings, sessions and so on. And it's utterly insane endeavor, as proficiency for them is not "in language x" but in "interpreting from language x to language y".
This is why there's a fucking army of them, all sucking on the massive tit of the EU money. And it's still not enough for every meeting that needs to take place. This is why de facto lingua franca of EU is English. Even though the only one small nation where it's a natio
The EU bureaucrats will always catch you! (Score:1)
Do Not Want (Score:2)
"Only a tiny fraction of the more than 7,000 languages
on Earth are supported by artificial intelligence models."
I consider them lucky.