U-Net Architecture Explained

Last Updated : 03 Jun, 2025

U-Net is a kind of neural network mainly used for image segmentation which means dividing an image into different parts to identify specific objects for example separating a tumor from healthy tissue in a medical scan. The name “U-Net” comes from the shape of its architecture which looks like the letter “U” when drawn. It is widely used in medical imaging because it performs well even with a small amount of labeled data.

U-Net Architecture

The architecture is symmetric and has three key parts:

Contracting Path (Encoder):

Uses small filters (3×3 pixels) to scan the image and find features.
Apply an activation function called ReLU to add non-linearity help the model to learn better.
Uses max pooling (2×2 filters) to shrink the image size while keeping important information. This helps the network focus on bigger features.

Bottleneck:

The middle of the “U” where the most compressed and abstract information is stored. It links the encoder and decoder.

Expansive Path (Decoder):

Uses upsampling i.e increasing image size to get back the original image size.
Combines information from the encoder using “skip connections.” These connections help the decoder get spatial details that might have been lost when shrinking the image.
Uses convolution layers again to clean up and refine the output.

U-Net Architecture -geeksforgeeks — U-Net Architecture

The above image shows U-Net turning a 572×572 image into a smaller 388×388 segmented map. It shrinks the image to capture features then upsamples to restore size using skip connections to keep details. The output labels each pixel as object or background.

How U-Net Works

After understanding the architecture, it’s important to see how U-Net actually processes data to perform segmentation:

Input Image: The process starts by feeding a medical or other input image typically grayscale into the network.
Feature Extraction (Encoder): The encoder extracts increasingly abstract features by applying convolutions and downsampling. At each level the spatial size decreases while the number of feature channels increases and allow the model to capture higher-level patterns.
Bottleneck Processing: This is the middle part of the network where the image is reduced the most. It holds a small but very meaningful version of the image that captures the main features.
Reconstruction and Localization (Decoder): The decoder begins to reconstruct the original image size through upsampling. At each level it combines decoder features with corresponding encoder features using skip connections to retain fine-grained spatial details.
Skip Connections for Precision: Skip connections help preserve spatial accuracy by bringing forward detailed features from earlier layers. These are especially useful when the model needs to distinguish boundaries in segmentation tasks.
Final Prediction: A 1×1 convolution at the end converts the refined feature maps into the final segmentation map where each pixel is classified into a specific class like foreground or background. This output has the same spatial resolution as the input image.

Implementation of U-Net

Now we will implement the U-Net architecture using Python 3 and the TensorFlow library. The implementation consists of three main parts:

Encoder Block: The contraction path block containing two 3x3 convolutional layers with ReLU activations, followed by a 2x2 max pooling layer.
Decoder Block: The expansive path block which upsamples the input, concatenates it with the corresponding encoder features and applies two 3x3 convolutional layers with ReLU activations.
U-Net Model: Combining the encoder and decoder blocks to define the complete U-Net architecture.

1. Encoder

The encoder is responsible for extracting features from the input image. It applies two convolutional layers followed by a ReLU Activation to learn patterns and then uses max pooling to reduce the image size help the model focus on important features.

Python

import tensorflow as tf

def encoder_block(inputs, num_filters):

    x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(inputs)
    x = tf.keras.layers.Activation('relu')(x)
    
    x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
    x = tf.keras.layers.Activation('relu')(x)

    x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2)(x)
    
    return x

2. Decoder

The decoder helps restore the original image size while combining the low-level and high-level features. It starts by upsampling the feature map, resizes the corresponding encoder output (skip connection), merges them and then applies two convolution layers with ReLU.

Python

def decoder_block(inputs, skip_features, num_filters):

    x = tf.keras.layers.Conv2DTranspose(num_filters, (2, 2), strides=2, padding='valid')(inputs)

    skip_features = tf.keras.layers.Resizing(x.shape[1], x.shape[2])(skip_features)

    x = tf.keras.layers.Concatenate()([x, skip_features])

    x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
    x = tf.keras.layers.Activation('relu')(x)
    x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
    x = tf.keras.layers.Activation('relu')(x)

    return x

3. Defining the U-Net Model

This function builds the complete U-Net architecture. It connects multiple encoder and decoder blocks and includes a bottleneck in the middle. The final output layer uses a sigmoid activation for segmentation.

Python

def unet_model(input_shape=(256, 256, 3), num_classes=1):
    inputs = tf.keras.layers.Input(shape=input_shape)
    
    # Contracting Path (Encoder)
    s1 = encoder_block(inputs, 64)
    s2 = encoder_block(s1, 128)
    s3 = encoder_block(s2, 256)
    s4 = encoder_block(s3, 512)
    
    # Bottleneck
    b1 = tf.keras.layers.Conv2D(1024, 3, padding='valid')(s4)
    b1 = tf.keras.layers.Activation('relu')(b1)
    b1 = tf.keras.layers.Conv2D(1024, 3, padding='valid')(b1)
    b1 = tf.keras.layers.Activation('relu')(b1)
    
    # Expansive Path (Decoder)
    d1 = decoder_block(b1, s4, 512)
    d2 = decoder_block(d1, s3, 256)
    d3 = decoder_block(d2, s2, 128)
    d4 = decoder_block(d3, s1, 64)
    
    outputs = tf.keras.layers.Conv2D(num_classes, 1, padding='valid', activation='sigmoid')(d4)
    
    model = tf.keras.models.Model(inputs=inputs, outputs=outputs, name='U-Net')
    return model

if __name__ == '__main__':
    model = unet_model(input_shape=(572, 572, 3), num_classes=2)
    model.summary()

Output:

Screenshot-2025-05-29-105001 — U Net Model

4. Applying the Model to an Image

Below is an example to load an image, preprocess it, run it through the U-Net model and save the predicted segmentation mask. You can download the input image from here

Python

import numpy as np
from PIL import Image
from tensorflow.keras.preprocessing import image

img = Image.open('cat.png').convert('RGB')
img = img.resize((572, 572))
img_array = image.img_to_array(img) / 255.0
img_array = np.expand_dims(img_array, axis=0)  

model = unet_model(input_shape=(572, 572, 3), num_classes=2)

predictions = model.predict(img_array)

pred_mask = np.squeeze(predictions, axis=0)
pred_mask = np.argmax(pred_mask, axis=-1).astype(np.uint8) * 255
pred_mask_img = Image.fromarray(pred_mask)
pred_mask_img = pred_mask_img.resize(img.size)

pred_mask_img.save('predicted_image.jpg')
pred_mask_img.show()

Output:

1/1 [==============================] - 2s 2s/step

Predicted Image-Geeksforgeeks — Predicted Image

We can see that our model is able to segement and create boundaries around the cat which means our model is working fine. U-Net is flexible and used in many areas like image cleaning, translation, enhancement, object detection and language tasks. You can also explore some of these applications in the following articles:

U-Net Architecture Explained

aditya_taparia

Improve

Article Tags :

Practice Tags :

U-Net Architecture Explained