LayerFusion - Merging Text and AI for Layered Image Generation
LayerFusion - Merging Text and AI for Layered Image Generation
Team 9
Abstract
Layered image creation is fundamental to professional digital media production, yet current AI image generation
tools produce only flat, single-layer outputs that cannot be easily integrated into professional workflows. We present
LayerFusion, an ensemble pipeline that combines multiple state-of-the-art image segmentation methods to
automatically convert AI-generated images into layered compositions. Our approach leverages five leading
segmentation techniques including CLIPSeg, ODISE, YOLO, HIPIE and Detectron2, orchestrated through a
modular pipeline that handles both RGB and depth-based segmentation. The pipeline accepts any AI-generated
image along with its creation prompt and outputs separated layers containing discrete objects mentioned in the
prompt. We evaluate our methodology using both objective metrics (Structural Similarity Index, Edge Match Ratio)
and practical benchmarks (task completion). The system provides artists and content creators with a crucial bridge
between AI generation capabilities and professional digital media workflows.
Keywords: Layered image generation, Image segmentation, Ensemble methods, AI-assisted digital art, Layer
separation, Generative AI, Feature extraction, Digital content creation, Computer vision
Section 1: Introduction
Layered image creation is a critical requirement for professional digital media creation, and current generative AI
tools are generally unable to produce layered content because they are trained on large databases of non-layered
images. Without a general method for separating AI images into layers, digital media creation tasks do not benefit
significantly from AI tools, and research to develop image-generating AI has limited value to media creation jobs.
Digital artists build images layer by layer to simplify the management of complex designs by allowing them to
control and manipulate individual elements independently (Fig 1). Each layer can represent a different component,
such as the background, foreground, specific objects, or even text, allowing artists to make selective adjustments
without affecting the rest of the composition. Layering is especially important in professional environments. For
example, Adobe Photoshop is widely used by more than 161,000 companies (6sense, n.d.), and Adobe describes
layers as “one of Photoshop’s most powerful features” (Adobe Inc., n.d.). Layered image production began as a
photographic technique in the 1850s (Kerr, 2022) but found value in the late 1890s as a technique to make animation
possible; each drawing in a sequence was transferred onto transparent acetate for inking, coloring, and compositing
onto a background (Wikipedia. Traditional Animation, n.d.). Later inventions extended the technique by placing
layers of illustration at various depths and filming while in motion, creating forms of animation that were too
expensive to draw by hand for 24 pictures-per-second (Wikipedia, Multiplane Camera, n.d.). These techniques are
preserved and improved upon in modern digital workflows, while new benefits to image creation have emerged that
rely upon layers, including: dividing a workload among artists for faster completion time and then perfectly
recombining the pieces; allowing artists to specialize in certain techniques (like figure drawing or landscape
drawing); enabling rapid, flexible content changes to respond to market demands; applying motion simulation and
special effects techniques onto artwork (Wang & Adelson, 1994).
Figure 1. Side-by-side illustration demonstrating an image drawn in layers. Notice that the car could be redrawn or
moved without affecting artwork for the mountains, trees, road, or bushes.
The absence of layering in AI-generated images presents a limitation, as these flat, unlayered outputs cannot be
easily integrated into the workflows professionals depend on. Figure 1 demonstrates some of the challenges. The left
image was produced by AI in under 10 seconds with a simple prompt; it could serve as the starting point for an
educational explainer video or a piece of marketing content. It might take 20 minutes to draw this picture by hand,
but the AI drawing allows us to experiment with a few different versions and then move on to the rest of the content
creation quickly. But this is a flat image so we cannot animate the pieces without first making careful cut outs of
each part that is supposed to move. Or if this is going to be used as promotional material and we wish to change the
sky color a little bit, we could make a color adjustment to the photo but again we need to make a careful cut out of
just the sky area so that we don’t tint other parts accidentally. As you can see in the image on the right each piece
2
has now been cut out to a separate layer and it is easy to move the parts separately or to color them independently or
even to replace the car with a bicycle while leaving all the other parts of the picture alone. This independent editing
capability is now at the core of most digital creation tools—even PowerPoint is built upon layers. Imagine needing
to completely re-make a whole PowerPoint slide every time a change in the wording or layout is needed; this is what
modern digital film animation and marketing image creation would be like without layers. This is why new, time-
saving AI image-generating techniques are not yet useful for most forms of professional media creation. The global
digital painting market was valued at USD 5.85 billion in 2024 and is projected to reach USD 18.48 billion by 2031
(Verified Market Research, 2024, July), with Adobe Inc’s stock nearly doubling over the past 5 years (Yahoo
Finance, 2024, Oct). This growth indicates a robust market that values tools for digital artwork creation. Digital art
is very important and has many applications as a growing field. First, within the true “art” industry, many pieces are
being created electronically instead of through traditional mediums. Beyond that, however, many practical industries
rely on digitally created images and “art”. Graphic design and digital artwork is especially useful for marketing,
education, and online environments. For example, graphic designers might create many different advertisements
with only minor differences for a new product and test them with a market research group. Creating these
advertisements one-by-one would take a long time but being able to use AI assistance and the “layer” technique will
help save time and allow for easy re-use of work. By developing tools that use SOTA image segmentation methods
to convert AI-generated images to layered images, we can bring the speed and flexibility of these new tools to
entertainment and marketing industries. To this end, we propose an image pipeline that accepts the results from any
artist-preferred image model and produces separate layers automatically for further refinement and professional
work, saving professionals many hours of work.
We identified the following five SOTA methods that we will use and evaluate for the sake of our project:
3
Yolo - You Only Look Once [Nate]
Method: Utilize YOLO to loop through a folder and detect an object in an image. From that point splice
out the detected object to be used in individual layers. Later, those layers can be repurposed in another
context by being overlaid on top of each other.
Implementation and Current Status: YOLO currently runs, detects objects in a photo and saves detected
objects into a folder appropriate for the item, for example a detected cat in an image is saved to the “cat”
folder. The purpose of this is to ultimately give us a pool of specific images to work with for our project
Pros: YOLO is extremely fast and detects objects in a folder of 2000 images in 21 seconds. Granted, these
were lower resolution photos. Completion times of higher resolution photos may be longer.
Cons: YOLO isn’t 100% accurate and can get false positives for objects. While overall maybe 75%- 80%
of the time an object is correctly detected, it can falsely detect an object when it isn’t present or mistake an
object for another. For example, a coffee mug can be mistaken for a cell phone if it’s far enough away in
the frame. It’s also possible for an object to be detected multiple times and be overlayed over itself. While
YOLO is accurate enough to use in most contexts, YOLO’s strong suit is speed above all else.
A current working environment has been created for our pretrained HIPIE model. Currently I am working
on compilation errors and intagrating into the overall project. By inputting text prompts, the model will be
able to isolate and identify objects within diverse scenes, providing adaptability and accuracy. Initial tests
on various images have shown that HIPIE can efficiently detect and segment objects without the need for
any retraining which will make it very suitable for real world applications.
Pros:
Open-vocabulary Flexibilty: HIPIE can understand and segment bject from text descriptions which will fit
directly into our data pipieline. This can also include objects that were not a part of its initial training data
making its model optimal for unseen classes and essential for real world application.
Hierarchical understanding: The hierarchical approach ensures the model can differentiate between general
and specific elements.
Ease of integration: Because it works with pretrained models, it will require less computational overhead
for initial setups so we can test varied contexts without extensive cost of retraining.
Adaptability to complex scenes: HIPIE excels at segmenting objects that are partially covered or in
cluttered environments or backgrounds making it extremely reliable in complex scenarios.
Cons:
4
Computationally intensive: HIPIE produces high quality segmentation masks which can be heavily
resource intensive which will lead to slower processing speeds in real time applications.
Dependency on text prompts: The success of the segmentation will rely heavily on the relevance of the text
prompt provided. Inaccurate or sub-optimal text prompts may lead to poor segementation results.
Limited performance on inputs: Like other segmentation models, HIPIE may struggle with more
ambiguous or overlapping inputs where context is key, additional refinement would be necessary for such
cases.
Scalability concerns: HIPIE deals with a broader array of classes and objects, there could be scalability
challenges especially if there is a large processing of a batch of diverse images.
In this proposal, we will describe the solution and process that we intend to implement. We will use a flexible
ensemble pipeline with a testing space for five SOTA methods to perform segmentation. We describe the pipeline,
the function of each method and models used. Our pipeline will produce a large corpus of test images, and we will
describe our methods for evaluating the results and determining the best overall segmentation technique. We will
wrap up the project description with a discussion of lessons learned to this point of the project. We conclude with
the estimation of the timeline to complete work on the remaining pieces of the pipeline and methods.
5
Both the input image and the text prompt are first encoded separately using the pre-trained CLIP model.
This produces a high-level feature embedding for the image and a corresponding embedding for the text
prompt. The goal is to align regions within the image to the textual description. CLIPSeg interpolates
between the original CLIP embeddings and segmentation-specific embeddings learned during fine-tuning
to predict the probability per-pixel of an object being present. CLIPSeg’s architecture makes it particularly
strong at segmenting novel presentations of objects—like a car on a ceiling—by adding a prediction
network to replace CLIP’s small-image latent encodings that favor familiar arrangements of items.
Retaining CLIP’s object understanding but producing a reduced-layer pixel-prediction model produces a
balance that is well-suited for open-ended prompts we might see in media editing, where predefined object
categories are insufficient, and the images are expected to be unusual compared to real photographs.
ClipSeg rates probability per-pixel that “this pixel is part of a car.” Probability 0-1 is mapped onto black-to-
white.
6
ODISE - Open-vocabulary DIffusion-based panoptic SEgmentation (Object
detection)
Method
ODISE exploits pre-trained text-image diffusion and discriminative models to perform open-vocabulary
panoptic segmentation. This model and code are open-source and found at
https://github.com/NVlabs/ODISE
Bytez. (2023). ODISE detects and labels diverse objects in images based on text descriptions, including
previously unseen categories. It combines a diffusion model with a discriminative model, such as CLIP, to
associate image features with text labels. This hybrid structure enables ODISE to identify a broad range of
objects without extensive retraining. https://bytez.com/docs/cvpr/23186/paper
NVlabs/ODISE. (2023). ODISE leverages the fixed representations of both models to perform panoptic
segmentation across any category in open environments. https://github.com/NVlabs/ODISE
Pros
Open-Vocabulary Recognition: ODISE identifies objects not explicitly defined in the training data. It
is applicable to real-world environments and for applications requiring flexibility of object recognition.
Implicit captioner: Avoiding reliance on pre-generated captions or pre-labeled datasets.
Fast Performance: Outperforms previous state-of-the-art methods on multiple tasks, including COCO
and ADE20K datasets.
Cons
Image-Text Pairs: Without extensive data for image-text pairs, the model may not perform as well.
Performance: ODISE requires a significant amount of computation which could make it very slow for
small organizations.
7
flexible natural language inputs. Therefore, HIPIE can segment objects that may not have been explicitly
labeled or defined in its training set, making it applicable for scenarios where the objects are diverse, new,
or not commonly encountered. The workflow begins by analyzing the input image and corresponding text
prompts, generating pixel-level segmentation masks that highligh objects based on the described
characteristics. By using the hierarchical approach, it can drill down to specific instances within the same
context such as with furniture ie chair or table.
The process of segmentation in HIPIE involves a few steps. First the system uses vision-language models
to interpret the input text and establish context for what the segmentation task should focus on. That means
that the users can prompt the model with specific descriptions like “segment all electronics in the room”
and HIPIE will identify and create segmentation masks for all the relevant objects, even if it has not been
explicitly exposed to them in training on those exact types. This is a significant advantage because it
removes the need for extensive retraining whenever there is a new object introduced. Once the
segmentation masks are generated, the hierarchical approach ensures that objects are identified and
categorized based on their visual context relationships. For example, it can distinguish between cup or
bottle even if they both belong to the category of containers by understanding the subtle differences
between the objects through the hierarchical processing.
8
Section 3: Proposed High-level Solution and
Process
Layer Fusion Pipeline for AI-Generated Images
The layer fusion pipeline is designed to accept an image file and a prompt, outputting a series of layers where each
layer contains a portion of the image that approximately corresponds to a unique image, object, or noun present in
the picture.
The intent is for this design to be flexible and modular because the image generation AI industry is rapidly evolving
with new models and particularly with fine-tuned LoRAs (Low-Rank Adaptations) published regularly, often even
fine-tuned by end users. Rather than training a custom image segmentation engine to work with a specific base
image model, we propose a workflow that can evaluate many segmentation models to determine which method
achieves the best balance of speed, accuracy, and prompt adherence for AI-generated images. We leverage the fact
that generative AI images are always accompanied by a descriptive prompt, which can be used to guide the
segmentation model.
Workflow Overview
In a finished tool version of this pipeline, we would accept a full text prompt and use natural language processing
tools to extract a list of tangible nouns for segmentation. These nouns, along with the AI-generated image, would be
fed into a custom program that selects an effective image tool, identifies areas of the image where each discrete
object has been drawn, separates those pieces through a binary image mask, and cuts out those portions of the
image. In this way, each separate object in the image is converted into a transparent layer containing only the color
pixel data for one object. These separate image files are then easily loaded into any multimedia editor as a series of
layers. For example, Adobe Photoshop can be extended with JavaScript, ActionScript, or a plug-in infrastructure
that would support exporting an image into this pipeline and then loading each of the separate images as individual
layers.
For our evaluation pipeline, we are focusing on producing rating scores for how five major current state-of-the-art
image segmentation techniques work, and so some portions of this pipeline have been modified to be better suited
for that task.
9
Preparation Steps
Our project pipeline begins with some preparation steps. We have developed a small piece of code to leverage
OpenAI's ChatGPT API to produce a series of image prompts and lists of tangible object nouns present in those
prompts. Since we are not evaluating natural language processing methods to extract nouns from descriptive prompt
text, we provide a consistent directory of prompts, including the nouns that are expected to be visually located in the
image as well as the full language that should be used to create the image. For our testing, we will use this list of
prompts and the expected nouns as a small database of images and nouns to evaluate the effectiveness of the
segmentation methods.
We are generating images this way to test a unique concept for which we have not found any readily available
dataset: images and the prompts that were used to generate them, sampled across a broad set of common nouns. This
is a departure from the methods generally used to develop and evaluate image segmentation algorithms, where real
images are passed through human evaluation to mark features, or where the features are unlabeled, and the training
produces a that can label data. Since we are interested in developing a pipeline that will work for AI content
creators, all of the text labels are already known since they were used to generate the image. This special feature of
AI-generated images will be used in combination with each of these image segmentation methods to see if we can
refine and guide their output to produce segmentation that is useful for our novel task—the separation of images into
layers containing discrete objects.
We have a simple Python module that loads our list of prompts for testing and separates them into:
From here, we move to the final step of the preparation process, which is also the first step in an intended finished
tool. We take the full text of the prompt and use it to generate an image using a well-researched methodology called
Stable Diffusion XL. This method is relatively fast, bound to VRAM and CUDA shader availability, and will
generate images that match our prompts and should contain the nouns in our noun list. We can do this in batch to
create hundreds of images, which we will use as part of our testing and evaluation suite.
Segmentation Methods
With a corpus of images and associated nouns, we loop through our testing pipeline, which starts by applying state-
of-the-art methods to produce a layered image as described above.
Each state-of-the-art method will accept an image and a list of objects to detect, producing a segmentation
representation of each object present in the image, with each method doing so according to its own way of working.
From these representations, we will produce black and white masks using a binary threshold. These masks will be
black except in places where an object was detected, where the mask will be white.
Each segmentation method will export a set of masks and return a list of their file names to the primary pipeline,
which will then use the masks to generate a series of layers by cutting out portions of each image. This is where the
tool would ideally finish its work in a professional workflow.
Models:
For each SOTA method, we implement models that are recent, trained the imageNet dataset, and described by their
authors to be benchmark-leading in at least one of the three metrics we prefer (speed, ability to detect objects
successfully, accuracy of regions identified).
10
HIPIE: vit_h_cloud.pth, a weighted pretrained model provided by
https://github.com/berkeley-hipie/HIPIE?tab=readme-ov-file
Pre-trained YOLO model.
Python Modules:
Evaluation Pipeline
To this process, we add the evaluation pipeline, which will help us evaluate which state-of-the-art methods are best
for speed, object detection, and accuracy on prompt-driven AI-generated images when given the text of the prompt
to search for. Our evaluation pipeline is a set of high-speed measurements chosen concerning our initial problem
statement. Artists traditionally create layered documents, and AI creates flat single-layer documents; significant time
and labor would be required to accurately cut out all the pieces of an AI-generated image to prepare it for a layered
workflow. We are evaluating:
Whether any of the objects supposed to be in the image return a blank mask due to the object not being
detected.
Whether the result of creating the layers for the document produced an image that is visually similar to the
original flat image generated by the AI.
The Structural Similarity Index (SSIM) score between the original and the layered image.
The Structural Similarity Index (SSIM) is a metric used to measure the similarity between two images to assess the
overall quality. It is used in evaluating compression algorithms, transmission formats, and generally, methods where
a distortion-free image can be compared to an image that has undergone a distortion. This method is provided in the
scikit-image library, offering a valuable measurement of image quality. SSIM produces a score based on three
separate factors, providing more insight about the difference between images than simply producing an image
difference or mean squared error from a pixel-wise comparison. It considers luminance, contrast, and structure by
evaluating brightness and contrast differences between the images, assigning scores to patches of differences
throughout the image, and then finding the mean score of all the patches of difference.
SSIM ( x , y )=
[ ( 2∗μx∗μy +C 1 ) ( 2∗σxy+ C 2 ) ]
[ ( μ x 2+ μ y 2 +C 1 ) ( σ x 2 +σ y 2 +C 2 ) ]
Where:
11
σxy = covariance of images x and y
C1 and C2 are “constants to stabilize the division with weak denominators,” defined as:
For each image we evaluate, our pipeline will accept the original image and the reconstruction of the image by
assembling layers, and compute the SSIM to assess how closely the reconstructed image matches the original.
This method evaluates the accuracy of image segmentation masks by comparing their edges to actual edges in the
source image. It's tolerant of small misalignments (allowing 1-pixel deviation) and produces a ratio between 0 and 1
indicating how well the mask edges align with actual image features. The scoring is particularly useful for
evaluating the quality of automated image segmentation or masking tools, where perfect pixel-perfect alignment
might not be necessary but general accuracy is important.
The source image undergoes Canny edge detection, followed by Gaussian blur (with blur_factor =
0.75)
For the mask image, it extracts contours and draws them as single-pixel edges
Canny edge detection is particularly well-suited for this mask evaluation task for several reasons:
Canny uses two thresholds (100, 200 in ours) to identify strong and weak edges
This reduces noise while preserving important edge continuity, making the comparison with mask
edges more reliable
2. Pre-Processing
3. Thin Edges
12
Canny produces single-pixel width edges, which works well with the single-pixel contour perimeters
evaluated in our masks
Makes pixel-to-pixel comparison more meaningful since both sets of edges are similarly thin
This combination of features makes Canny more reliable than simpler approaches like Sobel or simple thresholding
would be. The clean, thin edges it produces are ideal for comparing against mask boundaries, while its noise
resistance helps ensure the comparison focuses on meaningful image features rather than artifacts.
We have created a set of standardized prompts in a dataset that we will use to generate images for testing purposes.
An example of some prompts are as follows:
prompt1:: tall building, large window, city, giraffe's head, human arm, sparkling necklace, shadowy figure, dark
hair::A tall building with a large window overlooking a city, a giraffe's head peeking out from behind the window, a
human arm reaching out to touch the giraffe, a sparkling necklace hanging from the giraffe's neck, and a shadowy
figure with dark hair watching from the background.
prompt2:: bed, building, teddy bear::A cozy bed nestled in the corner of a grand, towering building, with a soft,
brown teddy bear resting on top.
The full evaluation was run for 100 prompts on 3 batches of images (300 samples to test).
At this point, we will have a database in which every original image will be scored by each of our state-of-the-art
methods on how well it performs in time, prompt adherence, structural similarity, and edge accuracy. From this, we
hope to identify the benefits and disadvantages of each method, recommend future work to improve these methods,
and determine whether one method or a combination of methods selected based on the prompt has performed best.
By evaluating these methods, we aim to identify which segmentation techniques offer the best performance for our
specific application of layering AI-generated images based on text prompts. We also consider the missing item ratio,
but this factor can be mitigated in the future by using our findings in this project to train a more comprehensive
model on a greater number of objects and masks.
13
Time (seconds): Mean: 1.9188 Mean: 20.3269 N/A 492 detections in Mean: 241.74
Std: 0.0560 Std: 0.5493 12.04 seconds Std: 0.27
527 detections in
12.53 seconds
435 detections in
12.44 seconds
SSIM: Mean: 0.7068 Mean: 0.3779 N/A
Std: 0.1341 Std: 0.1427
Average Edge Mean: 0.4660 Mean: 0.1096 N/A
Match Ratio: Std: 0.1326 Std: 0.1072
Missing Item Mean: 0.1739 Mean: 0.8382 N/A
Ratio: Std: 0.1454 Std: 0.1503
We encountered technical issues with the HIPIE implementation, where the method required custom training using
an outdated version of Nvidia’s Cuda Toolkit. The implementation from the published paper included a clone of
Detectron, used for initial training of the backbone piece of the method, and then relied upon this custom training
method to achieve the hierarchical “things versus stuff” separation that HIPIE is supposed to be able to perform. Our
initial investigation into this method seemed to show that sample model weights had been supplied, and it was not
until many weeks of work that we realized these weights weren’t working properly because of the requirement for
the intermediate training step. The hardware recommendation for this step greatly exceeded what we had available,
and so we were left with a classifier that basically is no different from Detectron, pre-trained on a 90 object resnet.
Head-To-Head Assessment
Although all of the methods we investigated have benchmarks comparing their performance on Imagenet, we found
that many of the pretrained weights available were for much smaller data sets or had other limitations like the
training required for HIPIE mentioned above.
While our assessment generally bears in mind that we’re not looking for the highest score, but rather the relative
scores of these methods, we realized part way through the project that we were going to also need to deal with the
fact that some of these methods would have a more significant number of missing object detections than others, to
an extent that would bias their performance. To that end, we updated our reporting to include per layer edge match
ratio scores, and we wrote a custom assessment program that runs off the evaluation reports head to head in order to
only compare our accuracy scoring metric for objects that were successfully detected in both methods being
compared. This gives us a better perspective on how successfully these methods work compared to one another,
irrespective of whether they have been trained to recognize a specific object.
CLIPSeg Detectron
Total wins per method 897 wins 3 wins
Processing Time (lower is better) 1.9188 ± 0.0560 20.3269 ± 0.5502
SSIM Score (closer to 1 is better) 0.7068 ± 0.1343 0.3779 ± 0.1429
Avg Edge Match (closer to 1 is better) 0.4660 ± 0.1328 0.1096 ± 0.1074
Wins in head-to-head comparisons 132 wins 68 wins
Average scores for items in common 0.4385 ± 0.2383 0.3035 ± 0.2753
We were not able to get every kind of metric and every kind of head to head score for all five of our methods against
one another because of some unexpected problems that cropped up over the course of the semester with computing
resources. The ODISE method relies on several pieces of detectron 2, and specifically on a modified version of
detectron 2 that requires pytorch and specifically an older version of pytorch (1.7). The distributable release of this
version, with setup and dependencies is no longer offered by Facebook and so significant efforts were made to get it
running locally, but the group member who was looking into this method could not run it on his Macbook with
14
silicon processor. This didn't entirely stop him and instead he got the whole thing running in a collab document with
the help of the WPI helpdesk , but it meant that it took about 20 minutes to start a procession and get all of the
dependencies running every time he needed to step away and return to the coat again, and this became a very serious
problem with getting the code to reliably run in our evaluations. Instead, this team member manually ran the
evaluation functions for a large batch of images and then produced the calculations come up but not the runoff
scoring. Similarly, the group member who was working on Poly Yolo, which is a segmentation variation of Yolo,
discovered that it depended upon weights for Yolo version 3, while Yolo 5 weights are the only ones we could find
commonly distributed for that method. Model weights that are commonly available for V 5 are stored in Keras
format, but we needed something called darknet format, which appears to be an older variant of tensor database. To
get around this, the group member who was investigating Yolo worked on an updated segmentation method using
version 5 and a custom cropping function built into the detection, but the results produced rough cutouts made out of
rectangles as the closest approximation he could get without a supercomputer cluster available to retrain polyolo.
Ultimately, we determined that ODISE was almost incomprehensibly slow and Yolo was the fastest method but
essentially incapable of segmentation, with no significant way to improve it through because the yolo 5 method
Does not represent its understanding of an object in a picture as a per pixel prediction, requiring us to use a kind of
Riemann’s sum box method. Focusing on the two methods that fully worked in the way that we had hoped for this
project, detecting objects and pictures and producing cutouts of them, the CLIPSeg method was easily our pick with
the recommendations we will cover in the Future Work section about how we would like to see it improved in a
future version of this task.
Perhaps the most significant lesson learned in evaluating these methods is the picture they paint about computational
resources. From our literature review, we felt that we would probably be trading between speed and accuracy in
these methods and that has certainly held true. We see for example that CLIPSeg is consistently pretty good and
pretty fast, but with an edge accuracy of about 40%, this is not particularly useable for layered image editing.
Detectron had tremendous performance gains when we factored out objects it could not find, but the mask accuracy
still fell slightly short of ClipSeg’s. This would seem to indicate that if we could train detectron on a much larger
data set, its accuracy could become the best one of our methods, but we also saw that detectron was slow, averaging
20 seconds to find all of the objects in an image and cut them out.
A final lesson learned, which might be a bit humorous but was certainly a major headache for our entire team, was
to beware of Detectron. Each of the methods that relied on it caused significant time and difficulty for the group
member involved. Three of our group members needed to have some form of detectron or the various tools that it
provides as part of their method, and the most successful of those three got the dependence these two install after 6
hours of continuous work. The other two group members worked nightly on chipping away at the bizarre errors and
15
dependencies for over two weeks before they got something that worked reliably. And trying to get all of these to
work within one environment so that we could run the evaluation pipeline Turned out to be technically impossible.
After long experience, it seems that many versions of Detectron are built with extremely narrow requirements, often
down to a single version of pytorch, or a single type of processing architecture, and many variations are no longer
distributed. Some of the others we encountered required us to downgrade Numpy to a significantly earlier version
incompatible with most of the layer fusion functionality. Detectron is a very powerful platform, as was evident by
how many unique and interesting methods hooked its various tools as essential parts of their experiments but it
quickly became apparent that it is best suited for single vector applications, where one specific install is designed for
one specific platform with 1 specific task to perform. Attempting to deploy it flexibly to locate objects in photos,
perform segmentation and as a step in various other segmentation pipelines with their own custom networks, proved
to be too many different things to try to get it to work with it once. We did not know that this was going to be the
case back in the second week of the term when we were choosing our methods, and so this was a significant lesson
learned regarding requirements and things to look out for with future projects that involve combining multiple state
of the art methods.
Based on our experiments with CLIPSeg, we propose several promising directions for future research to enhance the
model's segmentation capabilities. First, we suggest making fundamental improvements to the decoder architecture.
By incorporating additional transformer layers and multi-scale feature processing, the model could better capture
both fine-grained details and broader contextual information. The current simple linear projection must be replaced
with a more sophisticated upsampling architecture to generate more precise segmentation boundaries. We make this
suggestion on the basis of an example provided by the authors of CLIPSeg after publication, in which they
singnificantly improved per-pixel accuracy by exploring different connections in this part of the architecture.
Training methodology also presents significant opportunities for advancement. We propose exploring a broader
range of data augmentation techniques to improve the model's robustness and generalization capabilities.
Additionally, expanding the training dataset to include more diverse scenarios and object types could enhance the
model's real-world applicability. The introduction of specialized loss functions that specifically target boundary
accuracy could lead to more precise segmentation results, particularly in challenging cases with complex object
boundaries. We base this idea on the relatively limited PhraseCut dataset; while it contains 360,000 language/mask
pairs, the masks in the dataset are not always very precise. By improving dataset (possibly recruiting professionals to
assist with ground truth masks) we feel the training could improve.
Finally, we propose several architectural enhancements that could significantly improve performance. Exploring
additional skip connections from earlier CLIP layers could help preserve fine-grained spatial information throughout
the network. The addition of multi-resolution processing paths or more pooling layers could enable the model to
better handle objects at varying scales. Working with the relatively older CLIP required us to scale images to
352x352 for our initial layers. Though the CLIPSeg method works for larger resolutions, we suspect that nuance is
lost with this small file size requirement.
16
These proposed adjustments could collectively enhance CLIPSeg's ability to generate more precise and reliable
segmentation masks while maintaining its flexibility in handling both text and image prompts. Future research could
easily be plugged into our evaluation pipeline and toolset to track improvements.
Background research:
SDXL (Image generator):
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R.
(2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv.
https://arxiv.org/abs/2307.01952
Lüddecke, T., & Ecker, A. S. (2022). Image segmentation using text and image prompts. arXiv.
https://arxiv.org/abs/2112.10003
Xu, J., et al. (2023). Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv.
https://arx iv.org/abs/2303.04803
Poly-YOLO. (n.d.). Poly-YOLO: Higher speed, more precise detection, and instance segmentation for
YOLOv3. SpringerLink. https://link.springer.com/article/10.1007/s00521-021-05978-9
Song Q, Li S, Bai Q, Yang J, Zhang X, Li Z, Duan Z. Object Detection Method for Grasping Robot Based on
Improved YOLOv5. Micromachines. 2021; 12(11):1273. https://doi.org/10.3390/mi12111273
Xu, J., Han, Z., Xu, H., Zhang, C., Huang, K., & Bai, X. (2023). Hierarchical open-vocabulary universal
image segmentation. Papers with Code. https://paperswithcode.com/paper/hierarchical-open-vocabulary-
universal-image-1
Abhishek, A. V. S. (2021). Detectron2 object detection & manipulating images using cartoonization.
IJERT. https://www.ijert.org/research/detectron2-object-detection-manipulating-images-using-
cartoonization-IJERTV10IS080122.pdf
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer:
Hierarchical vision transformer using shifted windows. IEEE.
https://ieeexplore.ieee.org/document/9710580
17
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural
language supervision. arXiv. https://arxiv.org/abs/2103.00020
Ranftl, R., Lasinger, K., Hafner, D., et al. (2020). MiDaS: Mixed dataset training for monocular depth
estimation. arXiv. https://arxiv.org/abs/1907.01341
Lugmayr, A., Danelljan, M., Romero, A., et al. (2022). Denoising diffusion probabilistic models for image
inpainting. arXiv. https://arxiv.org/abs/2201.09865
Nguyen, T. T. H., Clement, T., Nguyen, P. T. L., Kemmerzell, N., Truong, V. B., Nguyen, V. T. K.,
Abdelaal, M., & Cao, H. (2024). LangXAI: Integrating large vision models for generating textual
explanations to enhance explainability in visual perception tasks. GitHub.
https://github.com/hungntt/LangXAI
References:
6sense. (n.d.). Companies using Adobe Photoshop. https://6sense.com/tech/graphic-design-software/adobe-
photoshop-market-share
Adobe Inc. (n.d.). Layer basics. Adobe Photoshop user guide. https://helpx.adobe.com/photoshop/using/layer-
basics.html
Boesch, G. (2024, February 11). Detectron2: A rundown of meta’s computer vision framework. viso.ai.
https://viso.ai/deep-learning/detectron2
Kamali, N., Nakamura, K., Chatzimparmpas, A., Hullman, J., & Groh, M. (2024). How to distinguish AI-generated
images from authentic photographs. arXiv. https://arxiv.org/abs/2406.08651
Kerr, M. P. M. (2022). Developing Fluid: Precision, Vagueness and Gustave Le Gray’s Photographic Beachscapes.
In Coastal Cultures of the Long Nineteenth Century (pp. 200–224). Edinburgh University Press.
https://doi.org/10.1515/9781474435758-015
Verified Market Research. (2024, July). Digital painting market size and forecast.
https://www.verifiedmarketresearch.com/product/digital-painting-market
Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image
Processing, 3(5). https://ieeexplore.ieee.org/document/334981
18
Wikipedia. (n.d.). Traditional Animation. https://en.wikipedia.org/wiki/Traditional_animation
19