Cloth Captioning
Cloth Captioning
Key Advantages:
It establishes a straightforward mapping from visual data to language.
The architecture is trained end-to-end, which simplifies the training process.
Limitations:
Because it relies on a single, global feature vector, it can sometimes miss out on important local details
in an image—details that might be crucial for describing nuanced aspects of clothing (like specific
patterns or textures).
Ref: Show and Tell: A Neural Image Caption Generator Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164.
2. Related Work
b. Spatial Attention & Dynamic Focus:
Show, Attend and Tell addressed the limitation of global representations by introducing an attention
mechanism:
Spatial Attention: spatial feature maps from an intermediate CNN layer
Dynamic Focus: dynamically generates a context vector for each word by taking a weighted sum of the
CNN feature map based on the computed attention weights.
Key Advantages:
It improves captioning performance by allowing the model to capture local details.
It provides interpretable attention maps that can sometimes highlight the regions corresponding to
particular words in the caption.
Ref: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu, K., Ba, J., Kiros, R., Cho, K.,
Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015).
In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057.
3. Motivations & Solutions
a. Motivations:
This problem can be solved in multiple ways:
Multi-label: Predict each attribute independently
Image Captioning: From one image generate the caption for that image
CLIP: Strong at vision-language understanding, learns image-text alignment by
training on vast amounts of images paired with textual descriptions..
ResNet50+LSTM or EfficientNet+GRU: Combines CNN for image features extraction
and LSTM for sequential attribute prediction, potentially capturing dependencies in
attributes.
Model Image Feature Extractor Text Processing Key Strengths
EfficientNet+GR
EfficientNet GRU More lightweight, better efficiency
U
3. Motivations & Solutions
b. Our solutions
EfficientNet + GRU
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: overview
4. Models Details
a. Summary & Explain Models (CLIP)
CLIP: Encoder (ViT) - details
Insight:
Contrastive loss encourages a model to learn
representations by drawing similar samples
(such as an image and its augmented version)
closer together while pushing dissimilar
samples apart
4. Models Details
a. Summary & Explain Models (Encoder-Decoder)
Encoder-Decoder: Resnet50+LSTM
Insights:
BCEWithLogitsLoss combines the sigmoid activation function
with binary cross-entropy (BCE) loss in a single step
improving numerical stability and training efficiency.
It prevents issues like vanishing or exploding gradients by
handling extreme logit values better than applying sigmoid
and BCE separately.
4. Models Details
b. Propose model
5. Implement Details
a. Training Pipeline (CLIP):
Backbone Model CLIP (ViT-B/32) Pretrained vision-language model extracts meaningful image-text features.
Feature Vector Size 512 Output dimension of CLIP's text and image embeddings.
Loss Function Cross-entropy Measures classification performance between predicted and true labels.
Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.
Pretrained Weights True Uses OpenAI’s pretrained weights for CLIP ViT-B/32.
5. Implement Details
b. Configs Explanation (ResNet50 + LSTM)
Parameter Value Explanation
Feature Vector Size 2048 Output dimension of ResNet50 before passing to LSTM.
Batch-First True Ensures input tensors are batch-major for easier processing.
Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.
Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.
Feature Vector Size 1792 Output dimension of EfficientNetB4 before passing to GRU.
Text Processing GRU (Gated Recurrent Unit) Capture temporal dependencies in feature sequences
Batch-First True Ensures input tensors are batch-major for easier processing.
Loss Function BCEWithLogitLoss Measures classification performance between predicted and true labels.
Similarity Metric Cosine Similarity Matches predicted embeddings with label embeddings.
Ground Truth
Casual Cotton Embroidery lettering Long-
Sleeve normal normal-fit Red T-shirts
Ground Truth
UNK Cotton Pockets Solid Sleeveless normal
normal-fit Lavender Shirts
Ground Truth
Casual Cotton Drop-shoulder lettering
Short-Sleeve normal normal-fit Black T-shirts