U-Net Architecture Explained
Last Updated :
03 Jun, 2025
U-Net is a kind of neural network mainly used for image segmentation which means dividing an image into different parts to identify specific objects for example separating a tumor from healthy tissue in a medical scan. The name “U-Net” comes from the shape of its architecture which looks like the letter “U” when drawn. It is widely used in medical imaging because it performs well even with a small amount of labeled data.
U-Net Architecture
The architecture is symmetric and has three key parts:
Contracting Path (Encoder):
- Uses small filters (3×3 pixels) to scan the image and find features.
- Apply an activation function called ReLU to add non-linearity help the model to learn better.
- Uses max pooling (2×2 filters) to shrink the image size while keeping important information. This helps the network focus on bigger features.
Bottleneck:
The middle of the “U” where the most compressed and abstract information is stored. It links the encoder and decoder.
Expansive Path (Decoder):
- Uses upsampling i.e increasing image size to get back the original image size.
- Combines information from the encoder using “skip connections.” These connections help the decoder get spatial details that might have been lost when shrinking the image.
- Uses convolution layers again to clean up and refine the output.
U-Net ArchitectureThe above image shows U-Net turning a 572×572 image into a smaller 388×388 segmented map. It shrinks the image to capture features then upsamples to restore size using skip connections to keep details. The output labels each pixel as object or background.
How U-Net Works
After understanding the architecture, it’s important to see how U-Net actually processes data to perform segmentation:
- Input Image: The process starts by feeding a medical or other input image typically grayscale into the network.
- Feature Extraction (Encoder): The encoder extracts increasingly abstract features by applying convolutions and downsampling. At each level the spatial size decreases while the number of feature channels increases and allow the model to capture higher-level patterns.
- Bottleneck Processing: This is the middle part of the network where the image is reduced the most. It holds a small but very meaningful version of the image that captures the main features.
- Reconstruction and Localization (Decoder): The decoder begins to reconstruct the original image size through upsampling. At each level it combines decoder features with corresponding encoder features using skip connections to retain fine-grained spatial details.
- Skip Connections for Precision: Skip connections help preserve spatial accuracy by bringing forward detailed features from earlier layers. These are especially useful when the model needs to distinguish boundaries in segmentation tasks.
- Final Prediction: A 1×1 convolution at the end converts the refined feature maps into the final segmentation map where each pixel is classified into a specific class like foreground or background. This output has the same spatial resolution as the input image.
Implementation of U-Net
Now we will implement the U-Net architecture using Python 3 and the TensorFlow library. The implementation consists of three main parts:
- Encoder Block: The contraction path block containing two 3x3 convolutional layers with ReLU activations, followed by a 2x2 max pooling layer.
- Decoder Block: The expansive path block which upsamples the input, concatenates it with the corresponding encoder features and applies two 3x3 convolutional layers with ReLU activations.
- U-Net Model: Combining the encoder and decoder blocks to define the complete U-Net architecture.
1. Encoder
The encoder is responsible for extracting features from the input image. It applies two convolutional layers followed by a ReLU Activation to learn patterns and then uses max pooling to reduce the image size help the model focus on important features.
Python
import tensorflow as tf
def encoder_block(inputs, num_filters):
x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(inputs)
x = tf.keras.layers.Activation('relu')(x)
x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
x = tf.keras.layers.Activation('relu')(x)
x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2)(x)
return x
2. Decoder
The decoder helps restore the original image size while combining the low-level and high-level features. It starts by upsampling the feature map, resizes the corresponding encoder output (skip connection), merges them and then applies two convolution layers with ReLU.
Python
def decoder_block(inputs, skip_features, num_filters):
x = tf.keras.layers.Conv2DTranspose(num_filters, (2, 2), strides=2, padding='valid')(inputs)
skip_features = tf.keras.layers.Resizing(x.shape[1], x.shape[2])(skip_features)
x = tf.keras.layers.Concatenate()([x, skip_features])
x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
x = tf.keras.layers.Activation('relu')(x)
x = tf.keras.layers.Conv2D(num_filters, 3, padding='valid')(x)
x = tf.keras.layers.Activation('relu')(x)
return x
3. Defining the U-Net Model
This function builds the complete U-Net architecture. It connects multiple encoder and decoder blocks and includes a bottleneck in the middle. The final output layer uses a sigmoid activation for segmentation.
Python
def unet_model(input_shape=(256, 256, 3), num_classes=1):
inputs = tf.keras.layers.Input(shape=input_shape)
# Contracting Path (Encoder)
s1 = encoder_block(inputs, 64)
s2 = encoder_block(s1, 128)
s3 = encoder_block(s2, 256)
s4 = encoder_block(s3, 512)
# Bottleneck
b1 = tf.keras.layers.Conv2D(1024, 3, padding='valid')(s4)
b1 = tf.keras.layers.Activation('relu')(b1)
b1 = tf.keras.layers.Conv2D(1024, 3, padding='valid')(b1)
b1 = tf.keras.layers.Activation('relu')(b1)
# Expansive Path (Decoder)
d1 = decoder_block(b1, s4, 512)
d2 = decoder_block(d1, s3, 256)
d3 = decoder_block(d2, s2, 128)
d4 = decoder_block(d3, s1, 64)
outputs = tf.keras.layers.Conv2D(num_classes, 1, padding='valid', activation='sigmoid')(d4)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs, name='U-Net')
return model
if __name__ == '__main__':
model = unet_model(input_shape=(572, 572, 3), num_classes=2)
model.summary()
Output:
U Net Model4. Applying the Model to an Image
Below is an example to load an image, preprocess it, run it through the U-Net model and save the predicted segmentation mask. You can download the input image from here
Python
import numpy as np
from PIL import Image
from tensorflow.keras.preprocessing import image
img = Image.open('cat.png').convert('RGB')
img = img.resize((572, 572))
img_array = image.img_to_array(img) / 255.0
img_array = np.expand_dims(img_array, axis=0)
model = unet_model(input_shape=(572, 572, 3), num_classes=2)
predictions = model.predict(img_array)
pred_mask = np.squeeze(predictions, axis=0)
pred_mask = np.argmax(pred_mask, axis=-1).astype(np.uint8) * 255
pred_mask_img = Image.fromarray(pred_mask)
pred_mask_img = pred_mask_img.resize(img.size)
pred_mask_img.save('predicted_image.jpg')
pred_mask_img.show()
Output:
1/1 [==============================] - 2s 2s/step
Predicted ImageWe can see that our model is able to segement and create boundaries around the cat which means our model is working fine. U-Net is flexible and used in many areas like image cleaning, translation, enhancement, object detection and language tasks. You can also explore some of these applications in the following articles: