0% found this document useful (0 votes)
17 views

Cloth Captioning

The document presents a clothing tag suggestion and tag-based search system developed by a team from FPT University, focusing on generating attributes from clothing images using various models like CLIP, ResNet50, and EfficientNet. It discusses the architecture, training pipelines, evaluation metrics, and results, highlighting the strengths and limitations of each approach. The findings indicate challenges such as overfitting and the need for improved attribute prediction accuracy.

Uploaded by

hungbtse181842
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Cloth Captioning

The document presents a clothing tag suggestion and tag-based search system developed by a team from FPT University, focusing on generating attributes from clothing images using various models like CLIP, ResNet50, and EfficientNet. It discusses the architecture, training pipelines, evaluation metrics, and results, highlighting the strengths and limitations of each approach. The findings indicate challenges such as overfitting and the need for improved attribute prediction accuracy.

Uploaded by

hungbtse181842
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CLOTHING TAG SUGGESTION &

TAG-BASED SEARCH SYSTEM


Subject: DBM302m
Supervisor: HaiNH51
Authors: Duy Hoang, Nguyen Tran, Duy Nguyen, Thanh Nguyen
Department of Information Science FPT Unversity Ho Chi Minh, Vietnam
Outlines 1. Introduction
2. Related Work
a. Encoder (CNN) & Decoder (RNN/ LSTM)
b. Spatial Attention & Dynamic Focus
3. Motivations & Solutions
a. Motivations
b. Our Solutions
4. Models Details
a. Summary & Explain Models
b. Propose Model
5. Implement Details
a. Training Pipeline
b. Configs Explanation
6. Results
a. Evaluation Metrics
b. Results Insight
c. Visualization
7. Discussion
1. Introduction
Summary of our problems
We want to create a model that can be used to generate tag (attributes)
from a cloth image.
Expected Input & Output

Input (Image) Output (Text)


Our goals:
Approach problems on different ways: image captioning, multilabel,
etc.
Create an minimal runable model that can generate attributes
Study and deep research about current methods
2. Related Work
a. Encoder (CNN) & Decoder (RNN/ LSTM):
Show and Tell was one of the first end-to-end neural network models for
image captioning. Its main components are:
Encoder and Decoder

Key Advantages:
It establishes a straightforward mapping from visual data to language.
The architecture is trained end-to-end, which simplifies the training process.
Limitations:
Because it relies on a single, global feature vector, it can sometimes miss out on important local details
in an image—details that might be crucial for describing nuanced aspects of clothing (like specific
patterns or textures).

Ref: Show and Tell: A Neural Image Caption Generator Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164.
2. Related Work
b. Spatial Attention & Dynamic Focus:
Show, Attend and Tell addressed the limitation of global representations by introducing an attention
mechanism:
Spatial Attention: spatial feature maps from an intermediate CNN layer
Dynamic Focus: dynamically generates a context vector for each word by taking a weighted sum of the
CNN feature map based on the computed attention weights.

Key Advantages:
It improves captioning performance by allowing the model to capture local details.
It provides interpretable attention maps that can sometimes highlight the regions corresponding to
particular words in the caption.

Ref: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu, K., Ba, J., Kiros, R., Cho, K.,
Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015).
In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057.
3. Motivations & Solutions
a. Motivations:
This problem can be solved in multiple ways:
Multi-label: Predict each attribute independently
Image Captioning: From one image generate the caption for that image
CLIP: Strong at vision-language understanding, learns image-text alignment by
training on vast amounts of images paired with textual descriptions..
ResNet50+LSTM or EfficientNet+GRU: Combines CNN for image features extraction
and LSTM for sequential attribute prediction, potentially capturing dependencies in
attributes.
Model Image Feature Extractor Text Processing Key Strengths

CLIP Vision Transformer (ViT) Transformer Strong image-text alignment

ResNet50+LSTM ResNet50 LSTM Captures sequential dependencies in labels

EfficientNet+GR
EfficientNet GRU More lightweight, better efficiency
U
3. Motivations & Solutions
b. Our solutions

CLIP ResNet50 + LSTM

EfficientNet + GRU
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: overview
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: Encoder (ViT) - details

Image encoder Text Encoder


4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: Loss - Contrastive loss (InfoNCE)

This contrastive loss will not be


directly computed but using cross
entropy loss

Insight:
Contrastive loss encourages a model to learn
representations by drawing similar samples
(such as an image and its augmented version)
closer together while pushing dissimilar
samples apart
4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: Resnet50+LSTM

Resnet50 architecture LSTM architecture


4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: EfficientNetB4 + GRU

EfficientNetB4 architecture GRU architecture


4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: EfficientNetB4 + GRU

Model Depth Scale Width Scale Parameters (M)

B1 1.1× 1.0× 7.8M

B2 1.2× 1.1× 9.2M

B3 1.4× 1.2× 12M

B4 1.8× 1.4× 19M

Comparison of different versions of EfficientNet


4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: BCE with logit loss

Insights:
BCEWithLogitsLoss combines the sigmoid activation function
with binary cross-entropy (BCE) loss in a single step
improving numerical stability and training efficiency.
It prevents issues like vanishing or exploding gradients by
handling extreme logit values better than applying sigmoid
and BCE separately.
4. Models Details
b. Propose model
5. Implement Details
a. Training Pipeline (CLIP):

CLIP (Contrastive Language-Image Pretraining)


Pretrained on a massive dataset of image-text pairs to learn a shared representation
between images and text.
Vision Transformer (ViT-B/32) as the image encoder, producing a feature vector of size 512.
Text Transformer as the text encoder, outputting a 512-dimensional embedding for textual
descriptions.
Both image and text embeddings are normalized and projected into a shared latent space for
similarity comparison.
5. Implement Details
a. Training Pipeline (CLIP):
1. Load Data (Train & Validation)
Splits dataset into training and validation sets. Intialize Model (CLIP)
Set Loss and Optimizer
Applies necessary transformations (e.g., resizing, (InfoNCE, AdamW)

normalization) based on CLIP’s preprocessing pipeline.


Converts images and text into tokenized inputs for CLIP.
Loads data in batches for efficient processing.
2. Initialize Model (CLIP)
Uses a pre-trained CLIP model (ViT-B/32) to extract joint
image-text representations.
Images are encoded using a Vision Transformer (ViT), while text
is processed via a Transformer-based language model.
Computes similarity scores between image and text
embeddings for classification or retrieval tasks.
3. Set Loss & Optimizer
Loss Function: Contrastive loss (InfoNCE) to maximize
similarity between matching image-text pairs while minimizing
mismatches.
Optimizer: AdamW (lr=1e-5) for stable and efficient fine-tuning
of CLIP’s parameters.
5. Implement Details
a. Training Pipeline (CLIP):

Set Loss and Optimizer


4. Training Loop (Forward, Backward, Optimize): Train the model by Intialize Model (CLIP)
(InfoNCE, AdamW)
performing forward and backward passes.
Process:
1. Forward Pass:
Pass images through CLIP’s vision encoder to extract
embeddings.
Pass text through CLIP’s text encoder to extract embeddings.
Compute cosine similarity between image and text embeddings.
2. Compute Loss:
Applies Contrastive Loss (InfoNCE) to maximize correct pair
similarity.
3. Backward Pass & Optimization:
Computes gradients (loss.backward()).
Updates model weights (optimizer.step()).
4. Track Training Loss:
Stores loss for later visualization and monitoring.
5. Implement Details
a. Training Pipeline (CLIP):

Set Loss and Optimizer


5. Validation (Evaluate Model): Measure model performance on Intialize Model (CLIP)
(InfoNCE, AdamW)
unseen validation data.
Extract text embeddings for all unique label descriptions.
Normalize text embeddings for similarity comparison.
Process test images to extract image embeddings.
Match each image embedding to the closest text embedding via
cosine similarity.
Compute:
Overall accuracy (full-label match).
Per-attribute accuracy (individual components like looks, fit,
colors, etc.).
5. Implement Details

a. Training Pipeline (CNN+ LSTM/GRU):

EfficientNet-B4 (CNN Backbone)


Pretrained on ImageNet for extracting meaningful image features.
The last classification layer (self.efficientnet.classifier) is removed (nn.Identity()) to output raw feature vectors.
EfficientNet-B4 outputs a feature vector of size 1792.
GRU (Recurrent Network)
Takes 1792-dimensional image features as input.
Processes them sequentially (useful for sequential tasks like video analysis).
Uses hidden size of 512 and 2 layers.
Fully Connected Layer (FC)
Maps the final GRU output to num_labels (number of classes).
self.fc = nn.Linear(hidden_size, num_labels)
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

1. Load Data (Train & Validation)


a. Splits dataset into training and validation sets.
b. Applies augmentations (e.g., resizing, normalization).
c. Converts images to tensors for PyTorch.
d. Loads data in batches for efficient processing.
2. Initialize Model (EfficientNet + GRU)
a. Uses EfficientNet-B4 as a feature extractor (removes final
classification layer).
b. Extracted features are fed into a GRU (Gated Recurrent
Unit) for sequence modeling.
c. A fully connected (FC) layer is added for final predictions.
3. Set Loss & Optimizer
a. Loss Function: BCEWithLogitsLoss()Used for binary
classification.
b. Combines Sigmoid Activation + Binary Cross-Entropy.
c. Optimizer: Adam(lr=1e-4)Adaptive learning rate
optimization for efficient convergence.
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

4. Training Loop (Forward, Backward, Optimize): Train the model by


performing forward and backward passes.
Process:
a. Forward Pass:
i. Passes images through EfficientNet to extract features.
ii. GRU processes extracted features and outputs
predictions.
b. Compute Loss:
i. Compares predictions to actual labels using
BCEWithLogitsLoss().
c. Backward Pass & Optimization:
i. Computes gradients (loss.backward()).
ii. Updates model weights (optimizer.step()).
d. Track Training Loss:
i. Stores loss for later visualization.
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

5. Validation (Evaluate Model): Measure model performance on


unseen validation data.
Process:
Disables gradient calculation (torch.no_grad()).
Runs the forward pass on validation images.
Computes validation loss (without updating weights).
Helps detect overfitting and compare model performance.
5. Implement Details
b. Configs Explanation (CLIP)
Parameter Value Explanation

Backbone Model CLIP (ViT-B/32) Pretrained vision-language model extracts meaningful image-text features.

Feature Vector Size 512 Output dimension of CLIP's text and image embeddings.

Text Processing Two-word labels hyphenated Ensures consistency in multi-word labels.

Batch Size 32 Number of samples processed together in training.

Train-Val Split 80%-20% Dataset split for training and validation.

Optimizer AdamW Adaptive optimization algorithm for model training.

Learning Rate 5e-5 Controls the step size during optimization.

Loss Function Cross-entropy Measures classification performance between predicted and true labels.

All except text_projection &


Frozen Layers Fine-tunes only embedding projection layers.
visual_projection

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses OpenAI’s pretrained weights for CLIP ViT-B/32.
5. Implement Details
b. Configs Explanation (ResNet50 + LSTM)
Parameter Value Explanation

Backbone Model ResNet50 Pretrained CNN extracts meaningful image-text features.

Feature Vector Size 2048 Output dimension of ResNet50 before passing to LSTM.

Text Processing LSTM Capture temporal dependencies in feature sequences

Hidden Size 512 Number of hidden units in LSTM layers

Batch-First True Ensures input tensors are batch-major for easier processing.

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.

Frozen Layers None Train all layers.

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses Resnet50’s pretrained weights.


5. Implement Details
b. Configs Explanation (EfficientNetB4 + GRU)
Parameter Value Explanation

Backbone Model EfficientNet Pretrained CNN extracts meaningful image-text features.

Feature Vector Size 1792 Output dimension of EfficientNetB4 before passing to GRU.

Text Processing GRU (Gated Recurrent Unit) Capture temporal dependencies in feature sequences

Hidden Size 512 Number of hidden units in GRU layers

Batch-First True Ensures input tensors are batch-major for easier processing.

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.

Frozen Layers None Train all layers.

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses Resnet50’s pretrained weights.


6. Results
a. Evaluation Metrics
Currently using: Proposing metrics
1. Classification Metrics 1. BLEU Scores:
Accuracy: Measures overall correctness. In the fashion domain, high BLEU scores indicate that the
Precision, Recall, F1-Score: Important if the dataset is generated captions capture a large portion of the key details
imbalanced. from reference descriptions.
2. Human Evaluation: ROC-AUC Score: Useful for evaluating ranking quality.
Given the subjective nature of fashion descriptions, some works 2. Ranking Metrics
also include human evaluation studies to assess the correctness, Mean Average Precision (MAP): Measures how well the model
completeness, and aesthetic quality of generated captions. ranks relevant items.
Normalized Discounted Cumulative Gain (NDCG): Evaluates
ranking quality with position importance.
Hit Rate: Measures how often the correct item is in the
recommended list.
3. Recommendation-Specific Metrics
Coverage: Percentage of catalog items recommended.
Diversity: Measures how varied the recommendations are.
Serendipity: Evaluates unexpected yet relevant
recommendations.
4. Embedding-Based Metrics
Cosine Similarity: Measures similarity between recommended
and ground truth items.
Mean Squared Error (MSE): If using embeddings to learn
similarity scores.
6. Results
b. Results Insight (CLIP)
1. Overall Trend
Training Loss consistently decreases from 0.18.. → 0.14..
Validation Loss also decreases from 0.15 → 0.18.., although with some
fluctuations after epoch 6.
The model shows an overfitting.
2. Key Observations
Training is major overfitting based on the loss curves.
Attribute-wise accuracy reveals inconsistencies in performance, indicating
imbalanced learning.
Some attributes (e.g., sleeveLength, looks, prints) achieve high accuracy,
while others (e.g., details) are severely underperforming.
3. Performance Analysis
Full Label Match Accuracy = 21.02% – This low score indicates that the
model rarely predicts all labels in a combination correctly.
The model seems to over-focus on dominant or visually obvious attributes
like Item Type, sleeveLength, and colors.
Fine-grained attributes like details (40.61%) and textures (77.59%) are
neglected.
This suggests that multi-label learning is suboptimal, possibly due to:
Label imbalance
Insufficient attribute-specific attention
Bias from dominant attribute classes (e.g., “T-shirts” in item)
6. Results
b. Results Insight (Resnet50+LSTM)
1. Overall Trend
Training Loss consistently decreases from approximately
0.19 to 0.08 over 10 epochs.
Validation Loss starts at approximately 0.12 and decreases
to about 0.07.
The model appears to be learning progressively as both
losses decrease steadily.
2. Key Observations
There is a larger initial drop in training loss between epochs
1-2 compared to later epochs.
Validation loss is consistently lower than training loss
throughout the training process.
No divergence between training and validation loss curves is
visible.
3. Performance Analysis
Full Label Accuracy = 0% – The model completely fails at
predicting the full combination of labels
6. Results
b. Results Insight (EfficientNetB4+GRU)
1. Overall Trend
Training Loss consistently decreases from 0.1936 → 0.0806.
Validation Loss also improves from 0.1252 → 0.0767.
The model gradually learns and generalizes well, as both losses decrease
together.
2. Key Observations
No signs of overfitting in loss curves, but the per-label accuracy tells a
different story.
Stable Training
3. Performance Analysis
Full Label Accuracy = 0% – The model never correctly predicts the full
combination of labels, indicating:
It focuses too much on dominant attributes (like Item Type) while ignoring
finer details.
Multi-label learning is not well-optimized, suggesting a need for better
attribute balancing.
Item might be biased as T-shirts is the dominant class of item attribute
6. Results
c. Visualization ResNet50 + LSTM result
CLIP result:

Ground Truth
Casual Cotton Embroidery lettering Long-
Sleeve normal normal-fit Red T-shirts

EfficientNetB4 + GRU result


6. Results
c. Visualization
CLIP result: ResNet50 + LSTM result

Ground Truth
UNK Cotton Pockets Solid Sleeveless normal
normal-fit Lavender Shirts

EfficientNetB4 + GRU result


6. Results
c. Visualization
CLIP result: ResNet50 + LSTM result

Ground Truth
Casual Cotton Drop-shoulder lettering
Short-Sleeve normal normal-fit Black T-shirts

EfficientNetB4 + GRU result


7. Discussion
Challenges and Future Directions:
Fine-Grained Details:
Clothing items have many subtle visual details. Models must
capture these details—such as texture, pattern, and fit—while
also being robust to variations in pose, lighting, and background.
Dataset Diversity:
Many datasets are biased toward Western fashion, so there’s a
growing need for diverse datasets (e.g., ArabicFashionData) that
cover different cultural styles and languages.
Dynamic and Interactive Systems:
Beyond static image captioning, there is increasing interest in
video captioning and interactive systems (e.g., virtual try-on)
where the model must generate descriptions in real time.
Multimodal Fusion:
Future work is likely to continue exploring how best to combine
visual features with textual or attribute information, potentially
through more sophisticated transformer-based architectures or
retrieval-augmented methods.
7. Discussion
Challenges and Future Directions:

Clothes image captioning has evolved from generic image


captioning approaches to sophisticated, fashion-oriented
models that integrate visual, semantic, and attribute information.
Early models based on CNN-RNN architectures paved the way
for attention mechanisms, while recent research leverages
transformer-based methods and external memory to achieve
more accurate and expressive captions.
The field continues to progress as researchers develop new
datasets and refine models to handle the complexities of fashion
data, addressing challenges such as dataset diversity, fine-
grained detail recognition, and cross-cultural representation.
THANK YOU FOR
YOUR ATTENTION !

You might also like