0% found this document useful (0 votes)

17 views

Cloth Captioning

The document presents a clothing tag suggestion and tag-based search system developed by a team from FPT University, focusing on generating attributes from clothing images using various models like CLIP, ResNet50, and EfficientNet. It discusses the architecture, training pipelines, evaluation metrics, and results, highlighting the strengths and limitations of each approach. The findings indicate challenges such as overfitting and the need for improved attribute prediction accuracy.

Uploaded by

hungbtse181842

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Cloth Captioning

Uploaded by

hungbtse181842

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CLOTHING TAG SUGGESTION &

Input (Image) Output (Text)

Our goals:
Approach problems on different ways: image captioning, multilabel,
etc.
Create an minimal runable model that can generate attributes
Study and deep research about current methods
2. Related Work
a. Encoder (CNN) & Decoder (RNN/ LSTM):
Show and Tell was one of the first end-to-end neural network models for
image captioning. Its main components are:
Encoder and Decoder

Key Advantages:
It establishes a straightforward mapping from visual data to language.
The architecture is trained end-to-end, which simplifies the training process.
Limitations:
Because it relies on a single, global feature vector, it can sometimes miss out on important local details
in an image—details that might be crucial for describing nuanced aspects of clothing (like specific
patterns or textures).

Ref: Show and Tell: A Neural Image Caption Generator Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164.
2. Related Work
b. Spatial Attention & Dynamic Focus:
Show, Attend and Tell addressed the limitation of global representations by introducing an attention
mechanism:
Spatial Attention: spatial feature maps from an intermediate CNN layer
Dynamic Focus: dynamically generates a context vector for each word by taking a weighted sum of the
CNN feature map based on the computed attention weights.

Key Advantages:
It improves captioning performance by allowing the model to capture local details.
It provides interpretable attention maps that can sometimes highlight the regions corresponding to
particular words in the caption.

Ref: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu, K., Ba, J., Kiros, R., Cho, K.,
Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015).
In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057.
3. Motivations & Solutions
a. Motivations:
This problem can be solved in multiple ways:
Multi-label: Predict each attribute independently
Image Captioning: From one image generate the caption for that image
CLIP: Strong at vision-language understanding, learns image-text alignment by
training on vast amounts of images paired with textual descriptions..
ResNet50+LSTM or EfficientNet+GRU: Combines CNN for image features extraction
and LSTM for sequential attribute prediction, potentially capturing dependencies in
attributes.
Model Image Feature Extractor Text Processing Key Strengths

CLIP Vision Transformer (ViT) Transformer Strong image-text alignment

ResNet50+LSTM ResNet50 LSTM Captures sequential dependencies in labels

EfficientNet+GR
EfficientNet GRU More lightweight, better efficiency
U
3. Motivations & Solutions
b. Our solutions

CLIP ResNet50 + LSTM

EfficientNet + GRU
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: overview
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: Encoder (ViT) - details

Image encoder Text Encoder

4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: Loss - Contrastive loss (InfoNCE)

This contrastive loss will not be

directly computed but using cross
entropy loss

Insight:
Contrastive loss encourages a model to learn
representations by drawing similar samples
(such as an image and its augmented version)
closer together while pushing dissimilar
samples apart
4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: Resnet50+LSTM

Resnet50 architecture LSTM architecture

4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: EfficientNetB4 + GRU

EfficientNetB4 architecture GRU architecture

4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: EfficientNetB4 + GRU

Model Depth Scale Width Scale Parameters (M)

B1 1.1× 1.0× 7.8M

B2 1.2× 1.1× 9.2M

B3 1.4× 1.2× 12M

B4 1.8× 1.4× 19M

Comparison of different versions of EfficientNet

4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: BCE with logit loss

Insights:
BCEWithLogitsLoss combines the sigmoid activation function
with binary cross-entropy (BCE) loss in a single step
improving numerical stability and training efficiency.
It prevents issues like vanishing or exploding gradients by
handling extreme logit values better than applying sigmoid
and BCE separately.
4. Models Details
b. Propose model
5. Implement Details
a. Training Pipeline (CLIP):

CLIP (Contrastive Language-Image Pretraining)

Pretrained on a massive dataset of image-text pairs to learn a shared representation
between images and text.
Vision Transformer (ViT-B/32) as the image encoder, producing a feature vector of size 512.
Text Transformer as the text encoder, outputting a 512-dimensional embedding for textual
descriptions.
Both image and text embeddings are normalized and projected into a shared latent space for
similarity comparison.
5. Implement Details
a. Training Pipeline (CLIP):
1. Load Data (Train & Validation)
Splits dataset into training and validation sets. Intialize Model (CLIP)
Set Loss and Optimizer
Applies necessary transformations (e.g., resizing, (InfoNCE, AdamW)

normalization) based on CLIP’s preprocessing pipeline.

Converts images and text into tokenized inputs for CLIP.
Loads data in batches for efficient processing.
2. Initialize Model (CLIP)
Uses a pre-trained CLIP model (ViT-B/32) to extract joint
image-text representations.
Images are encoded using a Vision Transformer (ViT), while text
is processed via a Transformer-based language model.
Computes similarity scores between image and text
embeddings for classification or retrieval tasks.
3. Set Loss & Optimizer
Loss Function: Contrastive loss (InfoNCE) to maximize
similarity between matching image-text pairs while minimizing
mismatches.
Optimizer: AdamW (lr=1e-5) for stable and efficient fine-tuning
of CLIP’s parameters.
5. Implement Details
a. Training Pipeline (CLIP):

Set Loss and Optimizer

4. Training Loop (Forward, Backward, Optimize): Train the model by Intialize Model (CLIP)
(InfoNCE, AdamW)
performing forward and backward passes.
Process:
1. Forward Pass:
Pass images through CLIP’s vision encoder to extract
embeddings.
Pass text through CLIP’s text encoder to extract embeddings.
Compute cosine similarity between image and text embeddings.
2. Compute Loss:
Applies Contrastive Loss (InfoNCE) to maximize correct pair
similarity.
3. Backward Pass & Optimization:
Computes gradients (loss.backward()).
Updates model weights (optimizer.step()).
4. Track Training Loss:
Stores loss for later visualization and monitoring.
5. Implement Details
a. Training Pipeline (CLIP):

Set Loss and Optimizer

5. Validation (Evaluate Model): Measure model performance on Intialize Model (CLIP)
(InfoNCE, AdamW)
unseen validation data.
Extract text embeddings for all unique label descriptions.
Normalize text embeddings for similarity comparison.
Process test images to extract image embeddings.
Match each image embedding to the closest text embedding via
cosine similarity.
Compute:
Overall accuracy (full-label match).
Per-attribute accuracy (individual components like looks, fit,
colors, etc.).
5. Implement Details

a. Training Pipeline (CNN+ LSTM/GRU):

EfficientNet-B4 (CNN Backbone)

Pretrained on ImageNet for extracting meaningful image features.
The last classification layer (self.efficientnet.classifier) is removed (nn.Identity()) to output raw feature vectors.
EfficientNet-B4 outputs a feature vector of size 1792.
GRU (Recurrent Network)
Takes 1792-dimensional image features as input.
Processes them sequentially (useful for sequential tasks like video analysis).
Uses hidden size of 512 and 2 layers.
Fully Connected Layer (FC)
Maps the final GRU output to num_labels (number of classes).
self.fc = nn.Linear(hidden_size, num_labels)
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

1. Load Data (Train & Validation)

a. Splits dataset into training and validation sets.
b. Applies augmentations (e.g., resizing, normalization).
c. Converts images to tensors for PyTorch.
d. Loads data in batches for efficient processing.
2. Initialize Model (EfficientNet + GRU)
a. Uses EfficientNet-B4 as a feature extractor (removes final
classification layer).
b. Extracted features are fed into a GRU (Gated Recurrent
Unit) for sequence modeling.
c. A fully connected (FC) layer is added for final predictions.
3. Set Loss & Optimizer
a. Loss Function: BCEWithLogitsLoss()Used for binary
classification.
b. Combines Sigmoid Activation + Binary Cross-Entropy.
c. Optimizer: Adam(lr=1e-4)Adaptive learning rate
optimization for efficient convergence.
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

4. Training Loop (Forward, Backward, Optimize): Train the model by

performing forward and backward passes.
Process:
a. Forward Pass:
i. Passes images through EfficientNet to extract features.
ii. GRU processes extracted features and outputs
predictions.
b. Compute Loss:
i. Compares predictions to actual labels using
BCEWithLogitsLoss().
c. Backward Pass & Optimization:
i. Computes gradients (loss.backward()).
ii. Updates model weights (optimizer.step()).
d. Track Training Loss:
i. Stores loss for later visualization.
5. Implement Details
a. Training Pipeline (CNN+ LSTM/GRU):

5. Validation (Evaluate Model): Measure model performance on

unseen validation data.
Process:
Disables gradient calculation (torch.no_grad()).
Runs the forward pass on validation images.
Computes validation loss (without updating weights).
Helps detect overfitting and compare model performance.
5. Implement Details
b. Configs Explanation (CLIP)
Parameter Value Explanation

Backbone Model CLIP (ViT-B/32) Pretrained vision-language model extracts meaningful image-text features.

Feature Vector Size 512 Output dimension of CLIP's text and image embeddings.

Text Processing Two-word labels hyphenated Ensures consistency in multi-word labels.

Batch Size 32 Number of samples processed together in training.

Train-Val Split 80%-20% Dataset split for training and validation.

Optimizer AdamW Adaptive optimization algorithm for model training.

Learning Rate 5e-5 Controls the step size during optimization.

Loss Function Cross-entropy Measures classification performance between predicted and true labels.

All except text_projection &

Frozen Layers Fine-tunes only embedding projection layers.
visual_projection

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses OpenAI’s pretrained weights for CLIP ViT-B/32.
5. Implement Details
b. Configs Explanation (ResNet50 + LSTM)
Parameter Value Explanation

Backbone Model ResNet50 Pretrained CNN extracts meaningful image-text features.

Feature Vector Size 2048 Output dimension of ResNet50 before passing to LSTM.

Text Processing LSTM Capture temporal dependencies in feature sequences

Hidden Size 512 Number of hidden units in LSTM layers

Batch-First True Ensures input tensors are batch-major for easier processing.

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.

Frozen Layers None Train all layers.

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses Resnet50’s pretrained weights.

5. Implement Details
b. Configs Explanation (EfficientNetB4 + GRU)
Parameter Value Explanation

Backbone Model EfficientNet Pretrained CNN extracts meaningful image-text features.

Feature Vector Size 1792 Output dimension of EfficientNetB4 before passing to GRU.

Text Processing GRU (Gated Recurrent Unit) Capture temporal dependencies in feature sequences

Hidden Size 512 Number of hidden units in GRU layers

Batch-First True Ensures input tensors are batch-major for easier processing.

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.

Frozen Layers None Train all layers.

Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.

Pretrained Weights True Uses Resnet50’s pretrained weights.

6. Results
a. Evaluation Metrics
Currently using: Proposing metrics
1. Classification Metrics 1. BLEU Scores:
Accuracy: Measures overall correctness. In the fashion domain, high BLEU scores indicate that the
Precision, Recall, F1-Score: Important if the dataset is generated captions capture a large portion of the key details
imbalanced. from reference descriptions.
2. Human Evaluation: ROC-AUC Score: Useful for evaluating ranking quality.
Given the subjective nature of fashion descriptions, some works 2. Ranking Metrics
also include human evaluation studies to assess the correctness, Mean Average Precision (MAP): Measures how well the model
completeness, and aesthetic quality of generated captions. ranks relevant items.
Normalized Discounted Cumulative Gain (NDCG): Evaluates
ranking quality with position importance.
Hit Rate: Measures how often the correct item is in the
recommended list.
3. Recommendation-Specific Metrics
Coverage: Percentage of catalog items recommended.
Diversity: Measures how varied the recommendations are.
Serendipity: Evaluates unexpected yet relevant
recommendations.
4. Embedding-Based Metrics
Cosine Similarity: Measures similarity between recommended
and ground truth items.
Mean Squared Error (MSE): If using embeddings to learn
similarity scores.
6. Results
b. Results Insight (CLIP)
1. Overall Trend
Training Loss consistently decreases from 0.18.. → 0.14..
Validation Loss also decreases from 0.15 → 0.18.., although with some
fluctuations after epoch 6.
The model shows an overfitting.
2. Key Observations
Training is major overfitting based on the loss curves.
Attribute-wise accuracy reveals inconsistencies in performance, indicating
imbalanced learning.
Some attributes (e.g., sleeveLength, looks, prints) achieve high accuracy,
while others (e.g., details) are severely underperforming.
3. Performance Analysis
Full Label Match Accuracy = 21.02% – This low score indicates that the
model rarely predicts all labels in a combination correctly.
The model seems to over-focus on dominant or visually obvious attributes
like Item Type, sleeveLength, and colors.
Fine-grained attributes like details (40.61%) and textures (77.59%) are
neglected.
This suggests that multi-label learning is suboptimal, possibly due to:
Label imbalance
Insufficient attribute-specific attention
Bias from dominant attribute classes (e.g., “T-shirts” in item)
6. Results
b. Results Insight (Resnet50+LSTM)
1. Overall Trend
Training Loss consistently decreases from approximately
0.19 to 0.08 over 10 epochs.
Validation Loss starts at approximately 0.12 and decreases
to about 0.07.
The model appears to be learning progressively as both
losses decrease steadily.
2. Key Observations
There is a larger initial drop in training loss between epochs
1-2 compared to later epochs.
Validation loss is consistently lower than training loss
throughout the training process.
No divergence between training and validation loss curves is
visible.
3. Performance Analysis
Full Label Accuracy = 0% – The model completely fails at
predicting the full combination of labels
6. Results
b. Results Insight (EfficientNetB4+GRU)
1. Overall Trend
Training Loss consistently decreases from 0.1936 → 0.0806.
Validation Loss also improves from 0.1252 → 0.0767.
The model gradually learns and generalizes well, as both losses decrease
together.
2. Key Observations
No signs of overfitting in loss curves, but the per-label accuracy tells a
different story.
Stable Training
3. Performance Analysis
Full Label Accuracy = 0% – The model never correctly predicts the full
combination of labels, indicating:
It focuses too much on dominant attributes (like Item Type) while ignoring
finer details.
Multi-label learning is not well-optimized, suggesting a need for better
attribute balancing.
Item might be biased as T-shirts is the dominant class of item attribute
6. Results
c. Visualization ResNet50 + LSTM result
CLIP result:

Ground Truth
Casual Cotton Embroidery lettering Long-
Sleeve normal normal-fit Red T-shirts

EfficientNetB4 + GRU result

6. Results
c. Visualization
CLIP result: ResNet50 + LSTM result

Ground Truth
UNK Cotton Pockets Solid Sleeveless normal
normal-fit Lavender Shirts

EfficientNetB4 + GRU result

6. Results
c. Visualization
CLIP result: ResNet50 + LSTM result

Ground Truth
Casual Cotton Drop-shoulder lettering
Short-Sleeve normal normal-fit Black T-shirts

EfficientNetB4 + GRU result

7. Discussion
Challenges and Future Directions:
Fine-Grained Details:
Clothing items have many subtle visual details. Models must
capture these details—such as texture, pattern, and fit—while
also being robust to variations in pose, lighting, and background.
Dataset Diversity:
Many datasets are biased toward Western fashion, so there’s a
growing need for diverse datasets (e.g., ArabicFashionData) that
cover different cultural styles and languages.
Dynamic and Interactive Systems:
Beyond static image captioning, there is increasing interest in
video captioning and interactive systems (e.g., virtual try-on)
where the model must generate descriptions in real time.
Multimodal Fusion:
Future work is likely to continue exploring how best to combine
visual features with textual or attribute information, potentially
through more sophisticated transformer-based architectures or
retrieval-augmented methods.
7. Discussion
Challenges and Future Directions:

Clothes image captioning has evolved from generic image

captioning approaches to sophisticated, fashion-oriented
models that integrate visual, semantic, and attribute information.
Early models based on CNN-RNN architectures paved the way
for attention mechanisms, while recent research leverages
transformer-based methods and external memory to achieve
more accurate and expressive captions.
The field continues to progress as researchers develop new
datasets and refine models to handle the complexities of fashion
data, addressing challenges such as dataset diversity, fine-
grained detail recognition, and cross-cultural representation.
THANK YOU FOR
YOUR ATTENTION !

Dump Nutanix-Nca
100% (2)
Dump Nutanix-Nca
7 pages
Migration Planning Guide
No ratings yet
Migration Planning Guide
84 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
No ratings yet
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
10 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
RP Springer
No ratings yet
RP Springer
10 pages
CLIP_ Connecting text and images _ OpenAI
No ratings yet
CLIP_ Connecting text and images _ OpenAI
16 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
1603.09016v2
No ratings yet
1603.09016v2
8 pages
Variational Autoencoder for Deep Learning of Images, Labels and Captions
No ratings yet
Variational Autoencoder for Deep Learning of Images, Labels and Captions
9 pages
371-1-2284-5-10-20240222
No ratings yet
371-1-2284-5-10-20240222
9 pages
2206.07669v2
No ratings yet
2206.07669v2
14 pages
Performance Evaluation of Medical Image Captioning Using
No ratings yet
Performance Evaluation of Medical Image Captioning Using
10 pages
2305.02932v2
No ratings yet
2305.02932v2
6 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
No ratings yet
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
19 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
Computer Vision 12 Vision Language Models(1)
No ratings yet
Computer Vision 12 Vision Language Models(1)
56 pages
Clip
No ratings yet
Clip
15 pages
Deep Learning
No ratings yet
Deep Learning
9 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
No ratings yet
Image Summarizer: Seeing Through Machine Using Deep Learning Algorithm
7 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Ref11
No ratings yet
Ref11
6 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Caption Generation With Visual Attention
No ratings yet
Caption Generation With Visual Attention
25 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Trustworthy - Final Essay
No ratings yet
Trustworthy - Final Essay
21 pages
Image Caption
No ratings yet
Image Caption
16 pages
Dense Clip
No ratings yet
Dense Clip
11 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
TC4033 FinalQuiz 33
No ratings yet
TC4033 FinalQuiz 33
5 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
CLIP-Reid
No ratings yet
CLIP-Reid
11 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Machine Learning Models
No ratings yet
Machine Learning Models
14 pages
Image Captioning: - A Deep Learning Approach
No ratings yet
Image Captioning: - A Deep Learning Approach
14 pages
ijariie26613
No ratings yet
ijariie26613
5 pages
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
No ratings yet
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
16 pages
DL project report
No ratings yet
DL project report
10 pages
Image Captioning
No ratings yet
Image Captioning
33 pages
14
No ratings yet
14
8 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Ref12
No ratings yet
Ref12
7 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
2023 Cross-Domain Image Captioning With Discriminative Finetuning
No ratings yet
2023 Cross-Domain Image Captioning With Discriminative Finetuning
10 pages
Presentation Manu Niha (1)
No ratings yet
Presentation Manu Niha (1)
11 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
ViT Explained
No ratings yet
ViT Explained
15 pages
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
7SA611 V4 - 7 - PRN - 190721
No ratings yet
7SA611 V4 - 7 - PRN - 190721
127 pages
CEPTAM 09 TECH A Examination555 PDF
No ratings yet
CEPTAM 09 TECH A Examination555 PDF
2 pages
iSL Brochure 7
No ratings yet
iSL Brochure 7
2 pages
Unit 1: The Business Scenario: Week 4: Dealing With Existing Code
No ratings yet
Unit 1: The Business Scenario: Week 4: Dealing With Existing Code
52 pages
Computer Generated in Nursing Care Plans
100% (2)
Computer Generated in Nursing Care Plans
4 pages
(Ebook) Global Fragments. (Dis)Orientation in the New World Order. Asnel Papers 10. (Cross Cultures 90) (Cross Cultures: Readings in the Post Colonial Literatures in English) by BARTELS; Anke and Dirk WIEMANN (Eds.) ISBN 9042021829 - The ebook in PDF format is ready for immediate access
100% (1)
(Ebook) Global Fragments. (Dis)Orientation in the New World Order. Asnel Papers 10. (Cross Cultures 90) (Cross Cultures: Readings in the Post Colonial Literatures in English) by BARTELS; Anke and Dirk WIEMANN (Eds.) ISBN 9042021829 - The ebook in PDF format is ready for immediate access
37 pages
Creating and Configuring Network Bonding Nic
No ratings yet
Creating and Configuring Network Bonding Nic
5 pages
TIA Portal OPC UA System Limits
No ratings yet
TIA Portal OPC UA System Limits
8 pages
Official IIA Glossary Spanish
No ratings yet
Official IIA Glossary Spanish
12 pages
Honeywell
No ratings yet
Honeywell
1 page
GoStitching Product Info V1.2 Eng
No ratings yet
GoStitching Product Info V1.2 Eng
19 pages
Updating Using Cli
No ratings yet
Updating Using Cli
22 pages
Module 8 Week 13, 14, 15
No ratings yet
Module 8 Week 13, 14, 15
7 pages
2015 Chapter 4 MMS ITencrypted
No ratings yet
2015 Chapter 4 MMS ITencrypted
29 pages
Muller and Phipps - Google Search
No ratings yet
Muller and Phipps - Google Search
1 page
Assignment 2
No ratings yet
Assignment 2
3 pages
Intelilite: New Features List
No ratings yet
Intelilite: New Features List
16 pages
Massive MIMO
100% (1)
Massive MIMO
18 pages
Log
No ratings yet
Log
24 pages
Earth work..CH-4 pdf.
No ratings yet
Earth work..CH-4 pdf.
12 pages
Template (C++) - Wikipedia
100% (1)
Template (C++) - Wikipedia
37 pages
Instant ebooks textbook (Ebook) Discrete Optimization in Architecture: Extremely Modular Systems by Machi Zawidzki (auth.) ISBN 9789811011085, 9789811011092, 9811011087, 9811011095 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Discrete Optimization in Architecture: Extremely Modular Systems by Machi Zawidzki (auth.) ISBN 9789811011085, 9789811011092, 9811011087, 9811011095 download all chapters
55 pages
Parameter Estimation of Linear Induction Motor Labvolt 8228-02
No ratings yet
Parameter Estimation of Linear Induction Motor Labvolt 8228-02
7 pages
Denon S-101
No ratings yet
Denon S-101
139 pages
LinuxONE 4 Level 1 Quiz Attempt Review2 PDF
No ratings yet
LinuxONE 4 Level 1 Quiz Attempt Review2 PDF
10 pages
Orbs Their Mission Messages Of Hope Klaus Heinemann Phd Gundi Heinemann pdf download
100% (1)
Orbs Their Mission Messages Of Hope Klaus Heinemann Phd Gundi Heinemann pdf download
26 pages
B. Tech. VI SEM FSD Question Bank 2024 25
No ratings yet
B. Tech. VI SEM FSD Question Bank 2024 25
3 pages
School Memo No.25 Checking of Forms Sy 2021 2022
No ratings yet
School Memo No.25 Checking of Forms Sy 2021 2022
6 pages

Cloth Captioning

Uploaded by

Cloth Captioning

Uploaded by

CLOTHING TAG SUGGESTION &

TAG-BASED SEARCH SYSTEM

Input (Image) Output (Text)

CLIP Vision Transformer (ViT) Transformer Strong image-text alignment

ResNet50+LSTM ResNet50 LSTM Captures sequential dependencies in labels

CLIP ResNet50 + LSTM

Image encoder Text Encoder

This contrastive loss will not be

Resnet50 architecture LSTM architecture

EfficientNetB4 architecture GRU architecture

Model Depth Scale Width Scale Parameters (M)

B1 1.1× 1.0× 7.8M

B2 1.2× 1.1× 9.2M

B3 1.4× 1.2× 12M

B4 1.8× 1.4× 19M

Comparison of different versions of EfficientNet

CLIP (Contrastive Language-Image Pretraining)

normalization) based on CLIP’s preprocessing pipeline.

Set Loss and Optimizer

Set Loss and Optimizer

a. Training Pipeline (CNN+ LSTM/GRU):

EfficientNet-B4 (CNN Backbone)

1. Load Data (Train & Validation)

4. Training Loop (Forward, Backward, Optimize): Train the model by

5. Validation (Evaluate Model): Measure model performance on

Text Processing Two-word labels hyphenated Ensures consistency in multi-word labels.

Batch Size 32 Number of samples processed together in training.

Train-Val Split 80%-20% Dataset split for training and validation.

Optimizer AdamW Adaptive optimization algorithm for model training.

Learning Rate 5e-5 Controls the step size during optimization.

All except text_projection &

Backbone Model ResNet50 Pretrained CNN extracts meaningful image-text features.

Text Processing LSTM Capture temporal dependencies in feature sequences

Hidden Size 512 Number of hidden units in LSTM layers

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Frozen Layers None Train all layers.

Pretrained Weights True Uses Resnet50’s pretrained weights.

Backbone Model EfficientNet Pretrained CNN extracts meaningful image-text features.

Hidden Size 512 Number of hidden units in GRU layers

Optimizer Adam Adaptive optimization algorithm for model training.

Learning Rate 1e-4 Controls the step size during optimization.

Frozen Layers None Train all layers.

Pretrained Weights True Uses Resnet50’s pretrained weights.

EfficientNetB4 + GRU result

EfficientNetB4 + GRU result

EfficientNetB4 + GRU result

Clothes image captioning has evolved from generic image

You might also like