Audio Seq2seq Model using Transformers
Last Updated :
24 Apr, 2025
The article explores the various applications of the Seq2Seq model in various fields, delving into its complexities. We'll also look at how audio transformation can be used in practice.
What is Seq2Seq model?
Seq2Seq are encoder and decoder models allowing for different lengths of inputs and outputs as the input is processed by the encoder and the output is processed by the decoder. These are typically used for Automatic Speech Recognizations (ASR), Speech to Speech Translation, and Speech Synthesis. Let us see how input is processed through a Seq2Seq model.
- Encoder: The encoder takes a log-mel spectrogram as input and transforms it into a sequence of hidden states. These encoder hidden states capture essential features from the spoken speech, representing the overall meaning of the input speech.
- Decoder: Subsequently, the encoder's output is inputted into the transformer decoder using a mechanism known as cross-attention. This mechanism, akin to self-attention but focused on the encoder output, allows the decoder to predict a sequence of text tokens in an autoregressive manner. Starting with an initial sequence containing only a "start" token (SOT in the case of Whisper), the decoder generates one token at a time. At each step, the previously generated sequence becomes the new input, progressively extending the output sequence. This process continues until the decoder predicts an "end" token or reaches a predefined maximum number of timesteps. In this architectural design, the decoder acts as a language model, leveraging the hidden-state representations from the encoder to produce corresponding text transcriptions.
For a more detailed understanding of how audio transformers work kindly look into the article Audio Transformer
In this article, we will focus on the implementation of the seq-to-seq model using a transformer. We will use the pre-trained whisper model from Hugginface and fine-tune it.
Audio Seq2seq Model Implementation using Transformers
Install the Necessary Libraries
Install the below libraries if not available in your environment. These are required to run the subsequent code.
- The 'Datasets' library is commonly used for working with machine learning datasets. The specific functions used from this library are load_dataset, Audio, and DatasetDict. These functions are used for loading datasets and dealing with audio data.
- 'Torch' is the PyTorch library, a popular deep-learning framework. This library provides tools for building and training neural networks
- 'Transformers' is a popular library for working with pre-trained language models from hugging face
- 'evaluate' contains functions contains related to evaluating models
# Import the necessary libraries
!pip install datasets
!pip install transformers
!pip install torch
!pip install evaluate
!pip install jiwer
!pip install transformers[torch]
!pip install numpy
Step 1: Import the Necessary Libraries
And then import the libraries into your notebook
Python3
##Imports required
import numpy as np
from datasets import load_dataset, Audio, DatasetDict
import torch
import evaluate
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
from transformers import Seq2SeqTrainingArguments,Seq2SeqTrainer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
Step 2: Loading Dataset
About the PloyAi/minds14 dataset- MINDS-14 is a training and evaluation resource for intent detection tasks with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties
- Using the load_dataset function of the datasets library we load the minds14 dataset in 'English' from hugging face. We load only 80 samples for our demo purpose.
- Data Fields
- path (str): Path to the audio file
- audio (dict): Audio object including loaded audio array, sampling rate, and path to audio
- transcription (str): Transcription of the audio file
- english_transcription (str): English transcription of the audio file
- intent_class (int): Class id of intent
- lang_id (int): Id of language
- We remove the unnecessary columns using the remove_columns method on the dataset
- We then split the dataset in the ratio of 80: 20. We specify shuffle= false so that we get some examples while running the code
Python3
dataset = DatasetDict()
# Load the PolyAI dataset.
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train[:80]")
# Remove unnecessary columns
dataset.remove_columns(
['path', 'english_transcription', 'intent_class', 'lang_id'])
# Split the datasedataset into train and test
dataset = dataset.train_test_split(test_size=0.2, shuffle=False)
Step3: Data Pre-Processing and Tokenization
We first resample our audio data to 16khz from 8kz using an audio library as the whimper seq to seq model is trained on 16 kHz
Python3
dataset['train'] = dataset['train'].cast_column("audio", Audio(sampling_rate=16000))
dataset['test'] = dataset['test'].cast_column("audio", Audio(sampling_rate=16000))
Let us import the whisper model and processor from Hugging Face using the Transformers library
- We need to crate two variables with name 'input_featrues'(input array of sound wave in raw form) and 'labels'(transcription)
- We use the processor library from the transformer. It contains a feature extractor and tokenizer.
- The feature extractor is used to convert our raw audio data into the input form expected by the model.
- The tokenizer converts our target sentence into labels expected by the model
Python3
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained(
"openai/whisper-tiny.en", task="transcribe", model_max_length=225)
model = WhisperForConditionalGeneration.from_pretrained(
'openai/whisper-tiny.en')
model.to(device)
# Preparing a function to process the entire dataset
def prepare_dataset(batch):
audio = batch["audio"]
# batch["input_ids"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"],
# return_tensor = "pt").input_features[0]
batch["input_features"] = processor.feature_extractor(
audio["array"], sampling_rate=audio["sampling_rate"], return_tensor="pt").input_features[0]
batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
return batch
np.object = object
encoded_dataset = dataset.map(
prepare_dataset, remove_columns=data.column_names["train"], num_proc=4)
Step 4: Preparing data collator class
- The data collator class is used for the dynamic padding of our data during the training time.
- For audio, we need to pad differently for our input and output sentences as input is raw audio form and output is text.
- We pad the input and output and return them as tensors.
Python3
# Creating a DataCollatorClass
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: processor
padding: Union[bool, str] = "longest"
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need
# different padding methods
# print(features)
input_features = [{"input_features": feature["input_features"]}
for feature in features]
label_features = [{"input_ids": feature["labels"]}
for feature in features]
batch = self.processor.feature_extractor.pad(
input_features, return_tensors="pt")
labels_batch = self.processor.tokenizer.pad(
label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(
labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
# print(batch)
return batch # batch
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
Step 5: Model Evaluation
We will be evaluating our model on word error rate
Python3
# Evalution metric-
import evaluate
metric = evaluate.load("wer")
def compute_metrics(pred):
# wer = evaluate.load("wer")
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
Step 6: Define our trainer
- training_args: This section initializes an object of the Seq2SeqTrainingArguments class, which contains various training-related settings. Some notable parameters include:
- output_dir: Directory where the trained model will be saved.
- gradient_checkpointing: Enables gradient checkpointing, a memory-saving technique during training.
- per_device_train_batch_size: Number of training samples processed in each forward/backward pass on each device.
- learning_rate: Initial learning rate for the optimizer.
- fp16: Enables mixed-precision training using a 16-bit floating-point format for improved training speed and reduced memory usage.
- optim: The optimizer used for training; in this case, it's set to 'adafactor'.
- predict_with_generate: Indicates that the model should use generation during prediction.
- evaluation_strategy: Defines when to evaluate the model during training, in this case, after a certain number of steps.
- per_device_eval_batch_size: Batch size for evaluation.
- eval_steps: Number of steps between evaluations.
- load_best_model_at_end: Indicates whether to load the best model checkpoint at the end of training.
- metric_for_best_model: The metric used to determine the best model checkpoint.
- report_to: List of integrations to which training results should be reported, in this case, TensorBoard.
- trainer: This section initializes an object of the Seq2SeqTrainer class, which is responsible for handling the training process. It takes in the model, training arguments, datasets, tokenizers, data collator, and a function for computing metrics.
Python3
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
training_args = Seq2SeqTrainingArguments(
output_dir="seqtoseq-trained",
gradient_checkpointing=True,
per_device_train_batch_size=2,
learning_rate=1e-5,
warmup_steps=2,
max_steps=2000,
fp16=True ,#False ,#True,
optim='adafactor',
# group_by_length=True,
predict_with_generate=True,
evaluation_strategy="steps",
per_device_eval_batch_size=2,
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="wer",
report_to = ["tensorboard"],
#data_parallel=False
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["test"],
tokenizer=processor,
data_collator=data_collator,
compute_metrics=compute_metrics,
#data_parallel=False
# sampler = None
)
Step 6: Define Computation metric
- Load Word Error Rate (WER) Metric: The function loads a Word Error Rate metric. WER is commonly used to evaluate the performance of automatic speech recognition or sequence-to-sequence models by measuring the difference between predicted and reference sequences.
- Extract Predicted and Label IDs: The predicted and label IDs are extracted from the pred object, which is the output of the model during evaluation, and passed to this object
- Replace -100 with pad_token_id: Pad token IDs are replaced where the prediction is -100
- Decode Sequences: The predicted and label sequences are decoded using the tokenizer's batch_decode method, skipping special tokens. This step converts the token IDs back into human-readable text.
- Compute Word Error Rate (WER): The WER is computed using the external WER metric, comparing the predicted and reference sequences. The result is multiplied by 100 to obtain a percentage.
- Return WER in a Dictionary: The function returns a dictionary containing the computed WER under the key "wer". This format is suitable for reporting multiple metrics during training and evaluation.
Python3
import evaluate
metric = evaluate.load("wer")
def compute_metrics(pred):
# wer = evaluate.load("wer")
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
To start training run the below command
Python3
# Requires GPU for training
trainer.train()
Output:
Logs of training:
Step Training Loss Validation Loss Wer
100 No log 0.525059 4.193548
200 No log 0.532363 1.846774
300 No log 0.553872 1.161290
400 No log 0.568876 1.161290
500 0.000000 0.590014 1.169355
Step 7: Drawing inferences
Let us check the output of our model after training
Python3
# getting test data
inputs = processor(dataset['test'][8]["audio"]["array"],
sampling_rate=16000, return_tensors="pt").to(device).input_features
print(f"The input test audio is: {dataset['test'][8]['transcription']}")
generated_ids = model.generate(inputs=inputs)
transcription = processor.batch_decode(
generated_ids, skip_special_tokens=True)[0]
print(f'The output prediction is : {transcription}')
Output:
The input test audio is: how much do I have in my account
The output prediction is : 'm much do I have in my account
Conclusion:
In this article, we saw how to fine-tune an audio seq2seq model using the transformers library.
Similar Reads
Audio Classification using Transformers
Our daily life is full of different types of audio. The human brain can effectively classify different audio signals. But what about our machines? They can't even understand any audio signals by default. Classifying different audio signals is very important for different advanced tasks like speech r
5 min read
GAN vs. Transformer Models
Generative models have gained immense popularity in the realm of machine learning due to their ability to generate data, whether itâs realistic images, coherent text, or plausible audio. Among the most renowned architectures are Generative Adversarial Networks (GANs) and Transformer models. Each of
6 min read
Transformer Model from Scratch using TensorFlow
Transformers are deep learning architectures designed for sequence-to-sequence tasks like language translation and text generation. They uses a self-attention mechanism to effectively capture long-range dependencies within input sequences. In this article, weâll implement a Transformer model from sc
10 min read
Audio Seq2Seq Model
The sequence-to-sequence model has gained a considerable amount of attention and interest in the past few years due to its remarkable capability to generate speech of exceptional quality. These models have brought about a revolution in various domains, including but not limited to speech synthesis,
7 min read
Music Genre Classification using Transformers
All the animals have a reaction to music. Music is a special thing that has an effect directly on our brains. Humans are the creators of different types of music, like pop, hip-hop, rap, classical, rock, and many more. Specifically, music can be classified by its genres. Our brains can detect differ
5 min read
Audio Transformer
From revolutionizing computer vision to advancing natural language processing, the realm of artificial intelligence has ventured into countless domains. Yet, there's one realm that's been a consistent source of both fascination and complexity: audio. In the age of voice assistants, automatic speech
15+ min read
What is an Ideal Transformer?
In this article, we will look into a special type of transformer known as the Ideal Transformer which is designed in ideal condition with no loss and 100% efficiency. We will discuss what is a transformer, ideal transformer. We will look into the working principle, properties, and equations of the i
6 min read
Large Language Models (LLMs) vs Transformers
In recent years, advancements in artificial intelligence have led to the development of sophisticated models that are capable of understanding and generating human-like text. Two of the most significant innovations in this space are Large Language Models (LLMs) and Transformers. While they are often
7 min read
Understanding of OpenSeq2Seq
Prerequisites: LTSM, GRU In this article, we will be discussing a deep learning toolkit used to improve the training time of the current Speech Recognition models among other things like Natural Language Translation, Speech Synthesis and Language Modeling. Models built using this toolkit give a stat
7 min read
Audio classification using spectrograms
Our everyday lives are full of various types of audio signals. Our brains are capable of distinguishing different audio signals from each other by default. But machines don't have this capability. To learn audio classification, different approaches can be used. One of them is classification using sp
7 min read