Masked Autoencoders in Deep Learning

Last Updated : 23 Jul, 2025

Masked autoencoders are neural network models designed to reconstruct input data from partially masked or corrupted versions, helping the model learn robust feature representations. They are significant in deep learning for tasks such as data denoising, anomaly detection, and improving model generalization by training on incomplete data.

In this article, we are going to explore the working and architecture of masked autoencoders.

Table of Content

Understanding Masked Autoencoders
Architecture of Masked Autoencoders

1. Masking Mechanism
2. Encoder
3. Latent Space
4. Decoder
5. Output Layer

Mechanism of Masked Autoencoders

1. Masking Strategy
2. Encoding Phase
3. Decoding Phase

Implementation of Masked Autoencoders

Autoencoders are a class of artificial neural networks used to learn efficient representations of data, typically for the purpose of dimensionality reduction or feature learning. They consist of an encoder, which compresses the input into a latent-space representation, and a decoder, which reconstructs the input from this representation. Over the years, autoencoders have evolved significantly, leading to various advancements in deep learning.

Understanding Masked Autoencoders

Masked Autoencoders (MAEs) represent a recent advancement in the autoencoder architecture, primarily aimed at improving the efficiency and effectiveness of models in learning representations from high-dimensional data. MAEs have gained significant attention, especially in natural language processing (NLP) and computer vision, for their ability to model complex dependencies within data.

Masked Autoencoders introduce a novel approach by randomly masking portions of the input data and training the model to reconstruct the missing parts. This encourages the model to learn robust and meaningful representations that capture the underlying structure of the data.

Architecture of Masked Autoencoders

1. Masking Mechanism

The input data is partially masked or corrupted before being fed into the autoencoder. This masking can be implemented by randomly setting a portion of the input elements to zero or some constant value. The masking process creates an incomplete version of the original input that the autoencoder will learn to reconstruct.

2. Encoder

The encoder is designed to process the masked input and produce a compact, latent representation. This typically involves several layers, such as:

Convolutional Layers (for image data): These layers apply convolution operations to capture spatial hierarchies and features in the data.
Fully Connected Layers: These layers are used to compress the data into a lower-dimensional space, retaining essential features.
Activation Functions: Common activation functions like ReLU (Rectified Linear Unit) are used to introduce non-linearity and help the network learn complex patterns.

3. Latent Space

The latent space is the compressed representation of the input data. It captures the most important features and patterns while discarding redundant information. The size of the latent space is a critical hyperparameter that determines the balance between compression and the ability to reconstruct the input accurately.

4. Decoder

The decoder is responsible for reconstructing the original input from the latent representation. It typically mirrors the encoder's architecture but in reverse, involving:

Deconvolutional (Transposed Convolutional) Layers: These layers expand the latent representation back to the original input size by reversing the convolution operations.
Fully Connected Layers: These layers gradually increase the dimensionality of the data to match the original input.
Activation Functions: Non-linear activation functions help the decoder reconstruct complex patterns.

5. Output Layer

The output layer of the decoder produces the final reconstructed version of the input data. For image data, this often involves a sigmoid or tanh activation function to ensure the output values are within a valid range (e.g., [0, 1] for normalized pixel values).

Role of Masking in Learning Representations in Autoencoders
Masking introduces an element of noise or missing information in the input data, forcing the autoencoder to learn robust and meaningful representations to reconstruct the original data. By learning to infer the missing parts, the autoencoder becomes better at generalizing and capturing the underlying patterns in the data. This improves its ability to handle incomplete or noisy inputs in real-world applications.

Visual Representation of Architecture

Here's a simple visual representation of the masked autoencoder architecture:

Input Image (with Masking) -> Encoder -> Latent Space -> Decoder -> Reconstructed Image

Each component in this pipeline plays a crucial role in learning and reconstructing the data. The encoder compresses the masked input, the latent space captures essential features, and the decoder reconstructs the original data from this compact representation.

Key Points:

Masking Mechanism: Introduces missing information in the input.
Encoder: Compresses the input into a latent representation.
Latent Space: Holds the essential features of the input.
Decoder: Reconstructs the original input from the latent space.
Output Layer: Produces the final reconstructed image.

By combining these elements, masked autoencoders effectively learn to reconstruct missing or corrupted parts of the input data, making them powerful tools for various applications in deep learning.

Mechanism of Masked Autoencoders

1. Masking Strategy

There are two types of masking Strategies:

Random Masking: In this strategy, random elements of the input data are masked. This can be represented as: \tilde{x}=M⊙x. Here, \tilde{x} is the masked input, M is a binary mask matrix (with elements 0 or 1), and ⊙ denotes element-wise multiplication. The mask M is generated randomly, masking a fixed percentage of the input.
Structured Masking: In structured masking, specific patterns or blocks of the input data are masked. This is useful in scenarios where the data has inherent structure, such as images or sequences. The mask M can be designed to hide contiguous blocks or specific regions of the input.

Impact of Different Masking Strategies on Learning

Random Masking: Encourages the model to learn general features that are distributed across the input. This can improve robustness but may lead to less efficient learning if important features are frequently masked.
Structured Masking: Forces the model to understand and reconstruct specific patterns or regions, which can be beneficial for tasks with spatial or sequential dependencies. However, it might lead to overfitting to certain patterns if not applied carefully.

2. Encoding Phase

In the encoding phase, the masked input \tilde{x} is processed to obtain a latent representation z. The encoder function f maps the masked input to the latent space:

z = f(\tilde{x})

Here, the encoder is typically a neural network that processes the masked input and captures essential features in the latent representation.

The goal of the encoder is to learn robust representations that can capture the underlying structure of the input data, even when parts of it are missing. The robustness is achieved by training the model to minimize the reconstruction error, ensuring that the latent representation z contains enough information to accurately reconstruct the original input x.

3. Decoding Phase

Reconstruction from Masked Latent Representations

In the decoding phase, the latent representation z is used to reconstruct the original input. The decoder function g maps the latent representation back to the input space:

\hat{x} = g(z)

where \hat{x} is the reconstructed input. The decoder is designed to fill in the missing parts of the input based on the learned representation.

Handling Missing Information

The reconstruction process involves generating the missing information that was masked during the input phase. The effectiveness of this process depends on the quality of the latent representation z learned by the encoder. The reconstruction error L is typically measured using a loss function such as Mean Squared Error (MSE):

L(x, \hat{x}) = \| x - \hat{x} \|^2

The model is trained to minimize this error, ensuring that the reconstructed input \hat{x} is as close as possible to the original input x.

Implementation of Masked Autoencoders

Step 1: Import Necessary Libraries

This step imports the required libraries for building and training the masked autoencoder.

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load and Preprocess the MNIST Dataset

The MNIST dataset is loaded, normalized, and reshaped to include a channel dimension.

# Load the MNIST dataset
(x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Add a channel dimension to the images
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

Step 3: Define the Masking Layer

This custom Keras layer applies random masking to the input data based on a specified mask ratio.

class MaskingLayer(layers.Layer):
    def __init__(self, mask_ratio=0.5, **kwargs):
        super(MaskingLayer, self).__init__(**kwargs)
        self.mask_ratio = mask_ratio

    def call(self, inputs):
        # Create a random mask
        mask = tf.random.uniform(shape=tf.shape(inputs)) > self.mask_ratio
        return tf.where(mask, inputs, tf.zeros_like(inputs))

Step 4: Apply the Masking Layer and Visualize the Result

The masking layer is applied to a sample of the training data, and the original and masked images are plotted.

# Example of using the MaskingLayer
masking_layer = MaskingLayer(mask_ratio=0.5)
masked_x_train = masking_layer(x_train[:5])

# Plot original and masked images
plt.figure(figsize=(10, 5))
for i in range(5):
    plt.subplot(2, 5, i + 1)
    plt.imshow(x_train[i].squeeze(), cmap='gray')
    plt.axis('off')

    plt.subplot(2, 5, i + 6)
    plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
    plt.axis('off')

plt.show()

Step 5: Build the Encoder

The encoder network is built using convolutional layers to compress the input data into a latent representation.

# Encoder
def build_encoder(input_shape):
    inputs = layers.Input(shape=input_shape)
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    latent = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
    return models.Model(inputs, latent, name='encoder')

Step 6: Build the Decoder

The decoder network is built using deconvolutional (upsampling) layers to reconstruct the input data from the latent representation.

# Decoder
def build_decoder(latent_shape):
    inputs = layers.Input(shape=latent_shape)
    x = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(inputs)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
    outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    return models.Model(inputs, outputs, name='decoder')

Step 7: Initialize Encoder and Decoder

The encoder and decoder models are instantiated.

input_shape = x_train.shape[1:]
encoder = build_encoder(input_shape)
latent_shape = encoder.output_shape[1:]
decoder = build_decoder(latent_shape)

Step 8: Build the Masked Autoencoder Model

The masked autoencoder model is created by combining the masking layer, encoder, and decoder.

# Masked Autoencoder
inputs = layers.Input(shape=input_shape)
masked_inputs = MaskingLayer(mask_ratio=0.5)(inputs)
latent = encoder(masked_inputs)
outputs = decoder(latent)

autoencoder = models.Model(inputs, outputs, name='masked_autoencoder')
autoencoder.compile(optimizer='adam', loss='mse')

Step 9: Train the Masked Autoencoder

The autoencoder model is trained using the MNIST training data.

# Train the model
autoencoder.fit(x_train, x_train, epochs=10, batch_size=128, validation_data=(x_test, x_test))

Step 10: Encode and Decode Images

The trained encoder and decoder are used to encode and decode a sample of the masked training data.

# Encode and decode some images
encoded_imgs = encoder.predict(masked_x_train)
decoded_imgs = decoder.predict(encoded_imgs)

Step 11: Visualize the Original, Masked, and Reconstructed Images

The original, masked, and reconstructed images are plotted for comparison.

# Plot original, masked, and reconstructed images
plt.figure(figsize=(15, 5))
for i in range(5):
    # Original images
    plt.subplot(3, 5, i + 1)
    plt.imshow(x_train[i].squeeze(), cmap='gray')
    plt.axis('off')

    # Masked images
    plt.subplot(3, 5, i + 6)
    plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
    plt.axis('off')

    # Reconstructed images
    plt.subplot(3, 5, i + 11)
    plt.imshow(decoded_imgs[i].squeeze(), cmap='gray')
    plt.axis('off')

plt.show()

Complete Code

Python

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST dataset
(x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Add a channel dimension to the images
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

class MaskingLayer(layers.Layer):
    def __init__(self, mask_ratio=0.5, **kwargs):
        super(MaskingLayer, self).__init__(**kwargs)
        self.mask_ratio = mask_ratio

    def call(self, inputs):
        # Create a random mask
        mask = tf.random.uniform(shape=tf.shape(inputs)) > self.mask_ratio
        return tf.where(mask, inputs, tf.zeros_like(inputs))

# Example of using the MaskingLayer
masking_layer = MaskingLayer(mask_ratio=0.5)
masked_x_train = masking_layer(x_train[:5])

# Plot original and masked images
plt.figure(figsize=(10, 5))
for i in range(5):
    plt.subplot(2, 5, i + 1)
    plt.imshow(x_train[i].squeeze(), cmap='gray')
    plt.axis('off')

    plt.subplot(2, 5, i + 6)
    plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
    plt.axis('off')

plt.show()

# Encoder
def build_encoder(input_shape):
    inputs = layers.Input(shape=input_shape)
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    latent = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
    return models.Model(inputs, latent, name='encoder')

# Decoder
def build_decoder(latent_shape):
    inputs = layers.Input(shape=latent_shape)
    x = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(inputs)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
    outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    return models.Model(inputs, outputs, name='decoder')

input_shape = x_train.shape[1:]
encoder = build_encoder(input_shape)
latent_shape = encoder.output_shape[1:]
decoder = build_decoder(latent_shape)

# Masked Autoencoder
inputs = layers.Input(shape=input_shape)
masked_inputs = MaskingLayer(mask_ratio=0.5)(inputs)
latent = encoder(masked_inputs)
outputs = decoder(latent)

autoencoder = models.Model(inputs, outputs, name='masked_autoencoder')
autoencoder.compile(optimizer='adam', loss='mse')

# Train the model
autoencoder.fit(x_train, x_train, epochs=10, batch_size=128, validation_data=(x_test, x_test))

# Encode and decode some images
encoded_imgs = encoder.predict(masked_x_train)
decoded_imgs = decoder.predict(encoded_imgs)

# Plot original, masked, and reconstructed images
plt.figure(figsize=(15, 5))
for i in range(5):
    # Original images
    plt.subplot(3, 5, i + 1)
    plt.imshow(x_train[i].squeeze(), cmap='gray')
    plt.axis('off')

    # Masked images
    plt.subplot(3, 5, i + 6)
    plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
    plt.axis('off')

    # Reconstructed images
    plt.subplot(3, 5, i + 11)
    plt.imshow(decoded_imgs[i].squeeze(), cmap='gray')
    plt.axis('off')

plt.show()

Output:

Epoch 1/10
469/469 [==============================] - 508s 1s/step - loss: 0.1127 - val_loss: 0.1140
Epoch 2/10
469/469 [==============================] - 488s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 3/10
469/469 [==============================] - 487s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 4/10
469/469 [==============================] - 479s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 5/10
469/469 [==============================] - 482s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 6/10
469/469 [==============================] - 479s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 7/10
469/469 [==============================] - 480s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 8/10
469/469 [==============================] - 482s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 9/10
469/469 [==============================] - 491s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 10/10
469/469 [==============================] - 481s 1s/step - loss: 0.1120 - val_loss: 0.1140

download-(18) — First Row represents original image, second row represents masked image and the third row represent the reconstructed image.

Note: If we increase the number of epochs, we can get better output of the reconstructed image.

Conclusion

Masked autoencoders, through their unique approach of reconstructing data from masked inputs, enhance model robustness and generalization. They are crucial in advancing deep learning applications such as data denoising and anomaly detection, making them indispensable in the field.

geeksforgn62g

Improve

Article Tags :