Conversational AI

Deploying NVIDIA Riva Multilingual ASR with Whisper and Canary Architectures While Selectively Deactivating NMT

Two people sitting at their desks with icons for speech translation in the background.

NVIDIA has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry.

Earlier versions of NVIDIA Riva, a collection of GPU-accelerated speech and translation AI microservices for ASR, TTS, and NMT, support English-Spanish and English-Japanese code-switching ASR models based on the Conformer architecture, along with a model supporting multiple common languages in the EMEA region (namely British English, European Spanish, French, Italian, Standard German, and Armenian) based on the Parakeet architecture.

Recently, NVIDIA released the Riva 2.18.0 container and SDK to keep evolving its speech AI models. With this new release, we now offer the following:

  • Support for Parakeet, the streaming multilingual ASR
  • Support for OpenAI’s Whisper-Large and HuggingFace’s Distil-Whisper-Large models for offline ASR and Any-to-English AST
  • The NVIDIA Canary models for offline ASR, Any-to-English, English-to-Any, and Any-to-Any AST
  • A new <dnt> SSML tag that tells a Megatron NMT model not to translate the enclosed text
  • A new DNT dictionary that tells a Megatron NMT model how to translate specified words or phrases

Automatic speech translation (AST) is the translation of speech in one language to text in another language without intermediate transcription in the first language.

NVIDIA also released NIM microservice implementations of Whisper and Canary (both 1B and 0.6B-Turbo) for optimized, modular, portable support of offline ASR and AST. NVIDIA Riva continues to support additional SOTA models and new architectures for both streaming and offline use cases, such as ASR-Translation (AST) models, S2S capabilities, and multilingual models.

In the demos in this post, we focus on Whisper and Canary for offline ASR and AST, along with selectively deactivating and editing Megatron NMT with <dnt> SSML tags and DNT dictionaries.

Riva multilingual offline ASR with Whisper and Canary for offline ASR

Riva’s new support of Whisper for offline multilingual ASR enables you to transcribe audio recordings in dozens of languages. Whisper can also translate audio from any of the supported languages into English automatically, instead of transcribing the audio in the source language and subsequently translating the transcription to English. 

The config.sh script included in the NGC Riva Skills Quick Start resource folder provides everything that you need for launching a Riva server with Whisper capabilities. Ensure that the following variables are set as indicated: 

service_enabled_asr=true
asr_acoustic_model=("whisper") # or "distil_whisper" for lower memory requirements
asr_acoustic_model_variant=("large") # the default "" will probably also work
riva_model_loc="<path/to/model/files/outside/container>"

To launch a Riva server with Canary capabilities instead, set those variables as follows: 

service_enabled_asr=true
asr_acoustic_model=("canary") 
asr_acoustic_model_variant=("1b") # or "0.6_turbo" for faster inference
riva_model_loc="<path/to/model/files/outside/container>"

Run the riva_init.sh script provided in the same directory to download the models in RMIR form and deploy versions of those models optimized for your particular GPU architecture. Then run the riva_start.sh script to launch the Riva server.

NIM microservice versions of Whisper and Canary (both 1B and 0.6B-Turbo) are also available. To launch either the Whisper or Canary NIM microservice on your own system, choose the Docker tab of the model’s landing page and follow the instructions. In either case, you must generate an NGC API key and export it as an environmental variable, NGC_API_KEY.

Here’s the docker run command for the Whisper NIM microservice: 

docker run -it --rm --name=riva-asr \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -e NIM_TAGS_SELECTOR=name=whisper-large-v3 \
   nvcr.io/nim/nvidia/riva-asr:1.3.0

To run the Canary NIM microservice instead, replace whisper-large-v3 with canary-1b or canary-0-6b-turbo in the docker run command. Irrespective of the ASR or AST model used, running a NIM microservice on your own system in this manner leaves the terminal hanging. You must use a different terminal or a different interface entirely to run inference with the Whisper or Canary NIM microservice. Otherwise, the process is identical to running inference with a Riva server set up with the classic Riva SDK.

When the Riva server is launched, you can submit inference calls to it with C++ or Python APIs. We use Python examples for the rest of this post. 

Import the Riva Python client module and connect to the Riva server as follows: 

import riva.client
import riva.client.proto.riva_asr_pb2 as riva_asr
uri = 'localhost:50051'
auth = riva.client.Auth(uri=uri)

Next, define a function like the following to transcribe audio files with Whisper or Canary: 

def run_ast_inference(audio_file, model, auth=auth, source_language='multi', target_language=None, print_full_response=False):
    assert model in ['whisper', 'canary']

    # The 'multi' language code doesn't work with Canary, so change it
    if model == 'canary' and source_language == 'multi': 
        source_language = 'en-US'
    
    # Ensure that the ASR/AST model is available
    model_available = False
    client = riva.client.ASRService(auth)
    config_response = client.stub.GetRivaSpeechRecognitionConfig(riva_asr.RivaSpeechRecognitionConfigRequest())
    for model_config in config_response.model_config: 
        model_name = model_config.model_name
        if model in model_name and 'offline' in model_name: 
            model_available = True
            break
    assert model_available == True, f'Error: {model.capitalize()} ASR/AST is not available'
    
    # Read in the audio file 
    with open(audio_file, 'rb') as fh:
        data = fh.read()

    config = riva.client.RecognitionConfig(
        language_code=source_language,
        max_alternatives=1,
        enable_automatic_punctuation=True,
        model=model_name,
    )

    if target_language is not None:
        riva.client.add_custom_configuration_to_config(config, f'target_language:{target_language}')
        riva.client.add_custom_configuration_to_config(config, 'task:translate')

    response = client.offline_recognize(data, config)
    
    if print_full_response: 
        print(response)
    else:
        print(response.results[0].alternatives[0].transcript)

For the Riva 2.17.0 version of Whisper, you had to set the language_code parameter in the call to riva.client.RecognitionConfig to "en-US", irrespective of the language of the audio file being transcribed. 

Likewise, if you wanted to tell Whisper to transcribe or translate from a particular language, you needed to pass in the source_language parameter by calling as follows:

riva.client.add_custom_configuration_to_config(config, f'source_language:{source_language}')

For Riva 2.18.0 and later, setting language_code='multi' in the call to riva.client.RecognitionConfig enables Whisper to automatically detect the language of the input audio file. On the other hand, Canary does not support automatic language detection, and won’t accept the 'multi' value for the language_code parameter. 

In the following demo video, one of us plays recordings of himself reading Article 1 of the Universal Declaration of Human Rights in both English and Swedish. The subsequent instructions for Whisper and Canary ASR and AST refer to recordings used in that video.

Video 1. Riva Multilingual ASR With Whisper and Canary for Offline ASR Demo

Pass the English-language recording into the inference function with otherwise default arguments as follows:

response = run_ast_inference('udhr-english.wav', model='whisper')

This yields the following accurate transcription: 

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

In turn, pass the Swedish-language recording into the inference function with otherwise default arguments as follows:

response = run_ast_inference('udhr-swedish.wav', model='whisper')

This yields the following accurate transcription: 

Alla människor är födda fria och lika i värde och rättigheter. De är utrustade med förnuft och samvete och bör handla gentemot varandra i en anda av broderskap.

To tell Whisper to perform any-to-English AST, pass in the target_language parameter and the source_language parameter, if desired in the form of a language code consisting of two lowercase letters or a language code combined with a country code consisting of two capital letters, with a dash separating the two. To obtain the two-letter code for a given country, use the pycountry Python module as follows: 

pycountry.countries.search_fuzzy('<Country Name>')

For example, you can obtain an English transcription of the Swedish audio file as follows: 

response = run_ast_inference('udhr-swedish.wav', model='whisper', target_language='en-US')

This yields the following translation: 

All people are born free and equal in value and rights. They are equipped with reason and conscience and should act against each other in a spirit of brotherhood.

Ideally, this translated text would be identical to the English version of Article 1 of the Universal Declaration of Human Rights. For the most part, it’s close enough. However, while the Swedish preposition “gentemot” can mean “against,” in this context, it should be translated as “towards.” 

As of this writing, Riva’s implementation of Whisper does not support streaming ASR or AST, English-to-Any AST, or Any-to-Any AST. 

Canary likewise supports both offline (but not streaming) ASR and AST. While it recognizes fewer languages than Whisper, it enables English-to-Any and Any-to-Any AST. 

For example, consider a recording of the German version of Article 1 of the UDHR:

Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

Run Canary AST on that recording as follows:

response = run_ast_inference('udhr-german.wav', model='canary', source_language='de-DE', target_language='es-US')

This yields the following Spanish translation:

Todos los hombres nace libres e iguales en dignidad y derechos, dotados de razón y conciencia y deben enfrentarse en el espíritu de la fraternidad. 

For comparison, the official Spanish version of Article 1 of the UDHR is as follows:

Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.

<dnt> SSML tags and dictionaries for selectively deactivating NMT and providing preferred translations

Riva 2.17.0 introduced <dnt> (that is, “do not translate”) SSML tags. Surrounding a word or phrase in a set of <dnt> tags tells Riva not to translate it. 

Video 2. Riva Do Not Translate SSML Demo

Riva 2.18.0 took the DNT concept a step further and enabled you to upload entire dictionaries of preferred translations, including none at all, of words and phrases. For words you don’t wish to translate and preferred translations, consider examples from Swedish and German, as one of us happens to speak those languages in addition to English.

There are several reasons why you might not want a translation model to translate part of an input text: 

  • The text contains a proper name with a meaning in the source language but which is typically rendered untranslated in the target language.
  • The target language lacks a precise equivalent to a given word or phrase in the source language.

The Swedish adjective “lagom” is notoriously difficult to translate to English, but it means, approximately, “not too much, not too little, just right.” Oddly enough, dictionary.com lists a definition of “lagom” as a loanword into English. More curiously, it describes “lagom” as a noun in English, whereas in Swedish, it’s strictly an adjective. 

In the models_nmt field in the config.sh script included in the Riva Skills Quick Start resource folder, set and uncomment as follows:

service_enabled_nmt=true
"${riva_ngc_org}/${riva_ngc_team}/rmir_nmt_megatron_1b_any_any:${riva_ngc_model_version}"

Next, import the Riva Client Python module in a Python script, interpreter, or notebook and connect to the Riva server. Now, you can define a function like the following to run NMT inference: 

def run_nmt_inference(texts, model, source_language, target_language, dnt_phrases_dict=None, auth=auth):
    client = riva.client.NeuralMachineTranslationClient(auth)
    resp = client.translate(texts, model, source_language, target_language, dnt_phrases_dict)
    return [translation.text for translation in resp.translations]

The following code example shows how to use <dnt> SSML tags to tell Riva NMT not to translate “lagom.” 

input_strings = [
    'Hur säger man <dnt>"lagom"</dnt> på engelska?'
]

model_name = 'megatronnmt_any_any_1b'
source_language = 'sv'
target_language = 'en'

translations = run_nmt_inference(input_strings, model_name, source_language, target_language)
for i, translation in enumerate(translations):
    print(f'\tTranslation {i}: {translation}')

This yields the following result:

Translation 0: How to say "lagom" in English?

Ideally, the translation should read, “How does one say ‘lagom’ in English?” or “How do you say ‘lagom’ in English?” 

You can achieve the same result with a dnt_phrases_dict dictionary: 

input_strings = [
    'Hur säger man "lagom" på engelska?'
]

dnt_phrases_dict = {"lagom": "lagom"}

model_name = 'megatronnmt_any_any_1b'
source_language = 'sv'
target_language = 'en'

translations = run_nmt_inference(input_strings, model_name, source_language, target_language)
for i, translation in enumerate(translations):
    print(f'\tTranslation {i}: {translation}')

Again, this yields the same result:

Translation 0: How to say "lagom" in English?

For preferred translations, consider the Swedish noun “särskrivning” and the German equivalent “Getrenntschreibung.” English has no direct translation for these words. 

Most Germanic languages other than English (including Swedish and German) make extensive use of compound words, particularly in the case of noun adjuncts (nouns used as adjectives). In both Swedish and German, noun adjuncts and the nouns which they modify form compound words. There is a tendency in both languages (partly due to English influence, partly due to typographers who believe that ending a line with a hyphen is aesthetically unappealing) to separate words which, according to current grammatical rules, should be joined. 

“Särskrivning” and “Getrenntschreibung,” both of which literally mean “separate-writing” or “separate-spelling,” are the respective Swedish and German words for this tendency and examples thereof. 

You can ask Riva to translate the Swedish sentence, “Särskrivningar förstörde mitt liv” (roughly speaking, “Särskrivningar [that is, the plural of särskrivning] ruined my life”) to German as follows. The following example uses a dictionary to indicate a preferred translation of “Särskrivningar” to “Getrenntschreibungen.”

input_strings = [
    'Särskrivningar förstörde mitt liv.'
]

dnt_phrases_dict = {"Särskrivningar": "Getrenntschreibungen"}

model_name = 'megatronnmt_any_any_1b'
source_language = 'sv'
target_language = 'de'

translations = run_nmt_inference(input_strings, model_name, source_language, target_language)
for i, translation in enumerate(translations):
    print(f'\tTranslation {i}: {translation}')

This yields the following result:

Translation 0: Getrenntschreibungen hat mein Leben ruiniert.

The auxiliary verb form should be “haben” rather than “hat” in this example, as “särskrivningar” in the source text and “Getrenntschreibungen” in the translated text are plural nouns, but otherwise, this translation is sufficiently accurate.

As of Riva 2.18.0, the megatron_any_any_1b model now consists of 1.6B parameters and offers bidirectional translation support for 36 languages in total, four more than previous versions. For example, this model considers European and Latin American Spanish as separate languages, along with Simplified and Traditional Chinese. 

As such, the model now requires that some language codes be expressed as two lowercase letters (the previous standard language code) followed by a dash and two uppercase letters (representing the country). 

Under this system, European and Latin American Spanish are coded respectively as 'es-ES' and 'es-US', while Simplified and Traditional Chinese are coded respectively as 'zh-CN' and 'zh-TW'. Languages that don’t require both a language and country code still support it. For example, you can tell Riva to use Swedish as a source or target language by passing in either 'sv' or 'sv-SE' to the appropriate parameter. 

Explore NGC’s Riva Skills Quick Start resource folder to launch a Riva server with NMT capabilities.

Discuss (0)

Tags