0% found this document useful (0 votes)

3 views

LayerFusion - Merging Text and AI for Layered Image Generation

LayerFusion is a proposed ensemble pipeline that enhances AI-generated images by converting them into layered compositions using five advanced image segmentation techniques. This approach addresses the limitations of current AI tools, which typically produce flat images, by allowing digital artists to manipulate individual elements independently, thereby integrating AI capabilities into professional workflows. The document outlines the methodology, evaluation metrics, and the significance of layered image production in digital media creation.

Uploaded by

journeyman.05

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

LayerFusion - Merging Text and AI for Layered Image Generation

Uploaded by

journeyman.05

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

LayerFusion

A workflow for advanced image production using GenAI

CS 534: Intro to Artificial Intelligence

Team 9

James Monaco, MS AI Andrew Keane, Graduate School, Nathaniel Hindman,

Artificial Intelligence Certificate Non-Degree Graduate

Joshua Castro, Graduate, Joshua Thurber, BS Data Science

PhD in Computer Science

Abstract
Layered image creation is fundamental to professional digital media production, yet current AI image generation
tools produce only flat, single-layer outputs that cannot be easily integrated into professional workflows. We present
LayerFusion, an ensemble pipeline that combines multiple state-of-the-art image segmentation methods to
automatically convert AI-generated images into layered compositions. Our approach leverages five leading
segmentation techniques including CLIPSeg, ODISE, YOLO, HIPIE and Detectron2, orchestrated through a
modular pipeline that handles both RGB and depth-based segmentation. The pipeline accepts any AI-generated
image along with its creation prompt and outputs separated layers containing discrete objects mentioned in the
prompt. We evaluate our methodology using both objective metrics (Structural Similarity Index, Edge Match Ratio)
and practical benchmarks (task completion). The system provides artists and content creators with a crucial bridge
between AI generation capabilities and professional digital media workflows.

Keywords: Layered image generation, Image segmentation, Ensemble methods, AI-assisted digital art, Layer
separation, Generative AI, Feature extraction, Digital content creation, Computer vision
Section 1: Introduction
Layered image creation is a critical requirement for professional digital media creation, and current generative AI
tools are generally unable to produce layered content because they are trained on large databases of non-layered
images. Without a general method for separating AI images into layers, digital media creation tasks do not benefit
significantly from AI tools, and research to develop image-generating AI has limited value to media creation jobs.
Digital artists build images layer by layer to simplify the management of complex designs by allowing them to
control and manipulate individual elements independently (Fig 1). Each layer can represent a different component,
such as the background, foreground, specific objects, or even text, allowing artists to make selective adjustments
without affecting the rest of the composition. Layering is especially important in professional environments. For
example, Adobe Photoshop is widely used by more than 161,000 companies (6sense, n.d.), and Adobe describes
layers as “one of Photoshop’s most powerful features” (Adobe Inc., n.d.). Layered image production began as a
photographic technique in the 1850s (Kerr, 2022) but found value in the late 1890s as a technique to make animation
possible; each drawing in a sequence was transferred onto transparent acetate for inking, coloring, and compositing
onto a background (Wikipedia. Traditional Animation, n.d.). Later inventions extended the technique by placing
layers of illustration at various depths and filming while in motion, creating forms of animation that were too
expensive to draw by hand for 24 pictures-per-second (Wikipedia, Multiplane Camera, n.d.). These techniques are
preserved and improved upon in modern digital workflows, while new benefits to image creation have emerged that
rely upon layers, including: dividing a workload among artists for faster completion time and then perfectly
recombining the pieces; allowing artists to specialize in certain techniques (like figure drawing or landscape
drawing); enabling rapid, flexible content changes to respond to market demands; applying motion simulation and
special effects techniques onto artwork (Wang & Adelson, 1994).

Figure 1. Side-by-side illustration demonstrating an image drawn in layers. Notice that the car could be redrawn or
moved without affecting artwork for the mountains, trees, road, or bushes.

The absence of layering in AI-generated images presents a limitation, as these flat, unlayered outputs cannot be
easily integrated into the workflows professionals depend on. Figure 1 demonstrates some of the challenges. The left
image was produced by AI in under 10 seconds with a simple prompt; it could serve as the starting point for an
educational explainer video or a piece of marketing content. It might take 20 minutes to draw this picture by hand,
but the AI drawing allows us to experiment with a few different versions and then move on to the rest of the content
creation quickly. But this is a flat image so we cannot animate the pieces without first making careful cut outs of
each part that is supposed to move. Or if this is going to be used as promotional material and we wish to change the
sky color a little bit, we could make a color adjustment to the photo but again we need to make a careful cut out of
just the sky area so that we don’t tint other parts accidentally. As you can see in the image on the right each piece

2
has now been cut out to a separate layer and it is easy to move the parts separately or to color them independently or
even to replace the car with a bicycle while leaving all the other parts of the picture alone. This independent editing
capability is now at the core of most digital creation tools—even PowerPoint is built upon layers. Imagine needing
to completely re-make a whole PowerPoint slide every time a change in the wording or layout is needed; this is what
modern digital film animation and marketing image creation would be like without layers. This is why new, time-
saving AI image-generating techniques are not yet useful for most forms of professional media creation. The global
digital painting market was valued at USD 5.85 billion in 2024 and is projected to reach USD 18.48 billion by 2031
(Verified Market Research, 2024, July), with Adobe Inc’s stock nearly doubling over the past 5 years (Yahoo
Finance, 2024, Oct). This growth indicates a robust market that values tools for digital artwork creation. Digital art
is very important and has many applications as a growing field. First, within the true “art” industry, many pieces are
being created electronically instead of through traditional mediums. Beyond that, however, many practical industries
rely on digitally created images and “art”. Graphic design and digital artwork is especially useful for marketing,
education, and online environments. For example, graphic designers might create many different advertisements
with only minor differences for a new product and test them with a market research group. Creating these
advertisements one-by-one would take a long time but being able to use AI assistance and the “layer” technique will
help save time and allow for easy re-use of work. By developing tools that use SOTA image segmentation methods
to convert AI-generated images to layered images, we can bring the speed and flexibility of these new tools to
entertainment and marketing industries. To this end, we propose an image pipeline that accepts the results from any
artist-preferred image model and produces separate layers automatically for further refinement and professional
work, saving professionals many hours of work.

We identified the following five SOTA methods that we will use and evaluate for the sake of our project:

CLIPSeg Text-to-Image Segmentation: [Jim]

Method: CLIPSeg solves image segmentation by building on an existing neural network model called
CLIP, which provides common vectorization for images and the text that would describe them. CLIPSeg
combines CLIP’s understanding with a new, specialized layer trained for segmentation that creates pixel-
level predictions objects described by text. This approach has distinct advantages: it doesn’t rely on a
predefined set of categories, allowing it to generalize across diverse prompts and segment novel objects.
Additionally, the probability scoring helps CLIPSeg locate and select objects that are rotated or positioned
in unusual ways. By interpolating between general-purpose CLIP embeddings and location-based
segmentation-specific embeddings, CLIPSeg basically combines the contextual awareness of CLIP and the
pixel location prediction of the new layers. These features make CLIPSeg particularly useful for
segmenting uncommon objects in varied scenes, so it should excel in our application, where the
segmentation models must interpret creative open-ended user inputs.. CLIPSeg is also essentially a single-
shot technique, making it fast enough to use as part of a creative pipeline where an artist needs the results
of their prompt in seconds to iterate upon design ideas.

ODISE - Open-vocabulary DIffusion-based panoptic SEgmentation [Andrew]

Method: ODISE is an innovative image segmentation framework that can accurately label and segment
objects in images, even those it hasn't been explicitly trained on. It achieves this by combining text-image
diffusion models with a discriminative model like CLIP. This approach allows ODISE to recognize a wide
range of objects based on text descriptions, making it ideal for dynamic and open environments. By
utilizing pre-trained models, ODISE efficiently performs panoptic segmentation without requiring
extensive retraining.

3
Yolo - You Only Look Once [Nate]
Method: Utilize YOLO to loop through a folder and detect an object in an image. From that point splice
out the detected object to be used in individual layers. Later, those layers can be repurposed in another
context by being overlaid on top of each other.

Implementation and Current Status: YOLO currently runs, detects objects in a photo and saves detected
objects into a folder appropriate for the item, for example a detected cat in an image is saved to the “cat”
folder. The purpose of this is to ultimately give us a pool of specific images to work with for our project

Pros: YOLO is extremely fast and detects objects in a folder of 2000 images in 21 seconds. Granted, these
were lower resolution photos. Completion times of higher resolution photos may be longer.

Cons: YOLO isn’t 100% accurate and can get false positives for objects. While overall maybe 75%- 80%
of the time an object is correctly detected, it can falsely detect an object when it isn’t present or mistake an
object for another. For example, a coffee mug can be mistaken for a cell phone if it’s far enough away in
the frame. It’s also possible for an object to be detected multiple times and be overlayed over itself. While
YOLO is accurate enough to use in most contexts, YOLO’s strong suit is speed above all else.

HIPIE – Hierarchical Open-vocabulary Universal Image Segmentation [Joshua

Thurber]
Method: HIPIE uses a hierarchical approach to open-vocabulary segmentation by leveraging pre trained
vision language models. The core of HIPIE’s method involves taking advantage of semantic segmentation,
which splits images into meaningful parts (segments) by combining pixel-wise classification with text
descriptions. It utilizes models trained on vision and language data to understand and generate
segmentation masks for both objects and backgrounds. This integration allows HIPIE to interpret object
descriptions and relate them to unseen image data, acheiving a high level of segmentation accuracy without
needing extensive retraining. Located at https://github.com/berkeley-hipie/HIPIE

Implementation and current status:

A current working environment has been created for our pretrained HIPIE model. Currently I am working
on compilation errors and intagrating into the overall project. By inputting text prompts, the model will be
able to isolate and identify objects within diverse scenes, providing adaptability and accuracy. Initial tests
on various images have shown that HIPIE can efficiently detect and segment objects without the need for
any retraining which will make it very suitable for real world applications.

Pros:

Open-vocabulary Flexibilty: HIPIE can understand and segment bject from text descriptions which will fit
directly into our data pipieline. This can also include objects that were not a part of its initial training data
making its model optimal for unseen classes and essential for real world application.

Hierarchical understanding: The hierarchical approach ensures the model can differentiate between general
and specific elements.

Ease of integration: Because it works with pretrained models, it will require less computational overhead
for initial setups so we can test varied contexts without extensive cost of retraining.

Adaptability to complex scenes: HIPIE excels at segmenting objects that are partially covered or in
cluttered environments or backgrounds making it extremely reliable in complex scenarios.

Cons:

4
Computationally intensive: HIPIE produces high quality segmentation masks which can be heavily
resource intensive which will lead to slower processing speeds in real time applications.

Dependency on text prompts: The success of the segmentation will rely heavily on the relevance of the text
prompt provided. Inaccurate or sub-optimal text prompts may lead to poor segementation results.

Limited performance on inputs: Like other segmentation models, HIPIE may struggle with more
ambiguous or overlapping inputs where context is key, additional refinement would be necessary for such
cases.

Scalability concerns: HIPIE deals with a broader array of classes and objects, there could be scalability
challenges especially if there is a large processing of a batch of diverse images.

Detectron2 – Platform for Object Detection and Segmentation [Joshua Castro]

Method: Detectron2 is Meta (Facebook)’s integrated platform for computer vision research. It includes
implementation of many algorithms and is available as an open-source library on GitHub. The model can
be applied to several computer vision problems, including object detection and multiple types of
segmentation, perfect for our intended use. Detectron2 uses a region-based CNN (Convolutional Neural
Network) and an algorithm to process objects in an image and comes with many useful training sets and
support for custom datasets.

In this proposal, we will describe the solution and process that we intend to implement. We will use a flexible
ensemble pipeline with a testing space for five SOTA methods to perform segmentation. We describe the pipeline,
the function of each method and models used. Our pipeline will produce a large corpus of test images, and we will
describe our methods for evaluating the results and determining the best overall segmentation technique. We will
wrap up the project description with a discussion of lessons learned to this point of the project. We conclude with
the estimation of the timeline to complete work on the remaining pieces of the pipeline and methods.

Section 2: Literature Review

CLIPSeg Text-to-Image Segmentation: (Object detection)

CLIPSeg uses a small network trained by processing both the input image and the text prompt using CLIP’s
pre-trained image and text encoders. The CLIP model was trained on over 400 million text/image pairs to
align images with text descriptions in a shared embedding space, so an image of a “yellow car” and the text
“yellow car” would have highly similar embeddings. This pre-existing alignment provides a strong
foundation for CLIPSeg, allowing it to “understand” the image and text descriptions. CLIPSeg extends
CLIP’s capabilities by adding a segmentation-specific decoder that takes these embeddings and outputs a
pixel-level mask. The decoder provided by the research is trained with data from PhraseCut, a 360,000
sample dataset with ground-truth segmentation masks paired with natural language descriptions.

5
Both the input image and the text prompt are first encoded separately using the pre-trained CLIP model.
This produces a high-level feature embedding for the image and a corresponding embedding for the text
prompt. The goal is to align regions within the image to the textual description. CLIPSeg interpolates
between the original CLIP embeddings and segmentation-specific embeddings learned during fine-tuning
to predict the probability per-pixel of an object being present. CLIPSeg’s architecture makes it particularly
strong at segmenting novel presentations of objects—like a car on a ceiling—by adding a prediction
network to replace CLIP’s small-image latent encodings that favor familiar arrangements of items.
Retaining CLIP’s object understanding but producing a reduced-layer pixel-prediction model produces a
balance that is well-suited for open-ended prompts we might see in media editing, where predefined object
categories are insufficient, and the images are expected to be unusual compared to real photographs.

ClipSeg rates probability per-pixel that “this pixel is part of a car.” Probability 0-1 is mapped onto black-to-
white.

6
ODISE - Open-vocabulary DIffusion-based panoptic SEgmentation (Object
detection)
Method

ODISE exploits pre-trained text-image diffusion and discriminative models to perform open-vocabulary
panoptic segmentation. This model and code are open-source and found at
https://github.com/NVlabs/ODISE

Bytez. (2023). ODISE detects and labels diverse objects in images based on text descriptions, including
previously unseen categories. It combines a diffusion model with a discriminative model, such as CLIP, to
associate image features with text labels. This hybrid structure enables ODISE to identify a broad range of
objects without extensive retraining. https://bytez.com/docs/cvpr/23186/paper

NVlabs/ODISE. (2023). ODISE leverages the fixed representations of both models to perform panoptic
segmentation across any category in open environments. https://github.com/NVlabs/ODISE

Pros

 Open-Vocabulary Recognition: ODISE identifies objects not explicitly defined in the training data. It
is applicable to real-world environments and for applications requiring flexibility of object recognition.
 Implicit captioner: Avoiding reliance on pre-generated captions or pre-labeled datasets.
 Fast Performance: Outperforms previous state-of-the-art methods on multiple tasks, including COCO
and ADE20K datasets.

Cons

 Image-Text Pairs: Without extensive data for image-text pairs, the model may not perform as well.
 Performance: ODISE requires a significant amount of computation which could make it very slow for
small organizations.

Yolo - You Only Look Once image detection

YOLO works by splitting an image into a grid, each grid cell. Each cell has probabilities for things it was
trained on. YOLO predicts multiple bounding boxes for the same object and uses non-maximum
suppression to remove redundant boxes, this keeps the highest confidence score of the detected
object. Training YOLO requires weights, annotations, label files as well as a dataset like COCO to give it
enough samples of objects to train off. YOLO typically will train a model for you, provided you give it a
data set that’s compatible with it. Then after the dataset is trained one can use the model it generated to
detect objects based on the criteria it was fed. YOLOv5 managed to reduce the repo in terms of hard disk
space while keeping performance relatively the same.

HIPIE – Hierarchical Open-vocabulary Universal Image Segmentation

HIPIE is a segmentation method that leverages pretrained vision-language models to perform panoptic,
instance, and semantic segmentation. It uses a hierarchical approach that allows it to understand and
process complex scenes by breaking them down into different levels of specificity. The core advantage of
HIPIE lies in its ability to handle open-vocabulary tasks. What sets it apart from other models that rely on
training datasets with fixed categories, HIPIE can interpret and generate segmentation masks based on

7
flexible natural language inputs. Therefore, HIPIE can segment objects that may not have been explicitly
labeled or defined in its training set, making it applicable for scenarios where the objects are diverse, new,
or not commonly encountered. The workflow begins by analyzing the input image and corresponding text
prompts, generating pixel-level segmentation masks that highligh objects based on the described
characteristics. By using the hierarchical approach, it can drill down to specific instances within the same
context such as with furniture ie chair or table.

The process of segmentation in HIPIE involves a few steps. First the system uses vision-language models
to interpret the input text and establish context for what the segmentation task should focus on. That means
that the users can prompt the model with specific descriptions like “segment all electronics in the room”
and HIPIE will identify and create segmentation masks for all the relevant objects, even if it has not been
explicitly exposed to them in training on those exact types. This is a significant advantage because it
removes the need for extensive retraining whenever there is a new object introduced. Once the
segmentation masks are generated, the hierarchical approach ensures that objects are identified and
categorized based on their visual context relationships. For example, it can distinguish between cup or
bottle even if they both belong to the category of containers by understanding the subtle differences
between the objects through the hierarchical processing.

Detectron2 – Platform for Object Detection and Segmentation

Detectron2 is built on PyTorch, which allows it to be flexible, scalable, modular, and easy-to-use for many
tasks such as object detection and instance segmentation. Models and pipelines can be easily built and
customized, and many pre-trained models are included such as Faster R-CNN, Mask R-CNN, RetinaNet,
and more, which is a strong starting point. The major benefit is that each of these models can then be
tweaked as desired to fit our project goals. Detectron2 is also optimized for speed and accuracy, as it is a
robust software developed by Meta, which also means it has a large community and plenty of online
documentation and tutorials available. It is somewhat complex to learn, and is very resource-intensive, but
is overall a great advanced option. The biggest advantage of Detectron2 is that it is incredibly robust. We
will have to see if it is too powerful for our desired results, as some of the other solutions might be more
memory and time efficient and still provide adequate solutions.

8
Section 3: Proposed High-level Solution and
Process
Layer Fusion Pipeline for AI-Generated Images

The layer fusion pipeline is designed to accept an image file and a prompt, outputting a series of layers where each
layer contains a portion of the image that approximately corresponds to a unique image, object, or noun present in
the picture.

The intent is for this design to be flexible and modular because the image generation AI industry is rapidly evolving
with new models and particularly with fine-tuned LoRAs (Low-Rank Adaptations) published regularly, often even
fine-tuned by end users. Rather than training a custom image segmentation engine to work with a specific base
image model, we propose a workflow that can evaluate many segmentation models to determine which method
achieves the best balance of speed, accuracy, and prompt adherence for AI-generated images. We leverage the fact
that generative AI images are always accompanied by a descriptive prompt, which can be used to guide the
segmentation model.

Workflow Overview

In a finished tool version of this pipeline, we would accept a full text prompt and use natural language processing
tools to extract a list of tangible nouns for segmentation. These nouns, along with the AI-generated image, would be
fed into a custom program that selects an effective image tool, identifies areas of the image where each discrete
object has been drawn, separates those pieces through a binary image mask, and cuts out those portions of the
image. In this way, each separate object in the image is converted into a transparent layer containing only the color
pixel data for one object. These separate image files are then easily loaded into any multimedia editor as a series of
layers. For example, Adobe Photoshop can be extended with JavaScript, ActionScript, or a plug-in infrastructure
that would support exporting an image into this pipeline and then loading each of the separate images as individual
layers.

For our evaluation pipeline, we are focusing on producing rating scores for how five major current state-of-the-art
image segmentation techniques work, and so some portions of this pipeline have been modified to be better suited
for that task.

This graphic is also included separately for ease of viewing

9
Preparation Steps

Our project pipeline begins with some preparation steps. We have developed a small piece of code to leverage
OpenAI's ChatGPT API to produce a series of image prompts and lists of tangible object nouns present in those
prompts. Since we are not evaluating natural language processing methods to extract nouns from descriptive prompt
text, we provide a consistent directory of prompts, including the nouns that are expected to be visually located in the
image as well as the full language that should be used to create the image. For our testing, we will use this list of
prompts and the expected nouns as a small database of images and nouns to evaluate the effectiveness of the
segmentation methods.

We are generating images this way to test a unique concept for which we have not found any readily available
dataset: images and the prompts that were used to generate them, sampled across a broad set of common nouns. This
is a departure from the methods generally used to develop and evaluate image segmentation algorithms, where real
images are passed through human evaluation to mark features, or where the features are unlabeled, and the training
produces a that can label data. Since we are interested in developing a pipeline that will work for AI content
creators, all of the text labels are already known since they were used to generate the image. This special feature of
AI-generated images will be used in combination with each of these image segmentation methods to see if we can
refine and guide their output to produce segmentation that is useful for our novel task—the separation of images into
layers containing discrete objects.

We have a simple Python module that loads our list of prompts for testing and separates them into:

 The prompt number.

 The list of nouns to be used for image segmentation.

 The full text descriptive language used to generate the image.

From here, we move to the final step of the preparation process, which is also the first step in an intended finished
tool. We take the full text of the prompt and use it to generate an image using a well-researched methodology called
Stable Diffusion XL. This method is relatively fast, bound to VRAM and CUDA shader availability, and will
generate images that match our prompts and should contain the nouns in our noun list. We can do this in batch to
create hundreds of images, which we will use as part of our testing and evaluation suite.

Segmentation Methods

With a corpus of images and associated nouns, we loop through our testing pipeline, which starts by applying state-
of-the-art methods to produce a layered image as described above.

Each state-of-the-art method will accept an image and a list of objects to detect, producing a segmentation
representation of each object present in the image, with each method doing so according to its own way of working.
From these representations, we will produce black and white masks using a binary threshold. These masks will be
black except in places where an object was detected, where the mask will be white.

Each segmentation method will export a set of masks and return a list of their file names to the primary pipeline,
which will then use the masks to generate a series of layers by cutting out portions of each image. This is where the
tool would ideally finish its work in a professional workflow.

Models:

For each SOTA method, we implement models that are recent, trained the imageNet dataset, and described by their
authors to be benchmark-leading in at least one of the three metrics we prefer (speed, ability to detect objects
successfully, accuracy of regions identified).

 CLIPSeg: rd64-uni-refined.pth, a refined weighted net provided by https://github.com/timojl/clipseg

10
 HIPIE: vit_h_cloud.pth, a weighted pretrained model provided by
https://github.com/berkeley-hipie/HIPIE?tab=readme-ov-file
 Pre-trained YOLO model.

Python Modules:

Scipy, diffusers, torch, torchvision, safetensors, pillow, matplotlib

Additional modules as required per model, numpy, etc.

Evaluation Pipeline

To this process, we add the evaluation pipeline, which will help us evaluate which state-of-the-art methods are best
for speed, object detection, and accuracy on prompt-driven AI-generated images when given the text of the prompt
to search for. Our evaluation pipeline is a set of high-speed measurements chosen concerning our initial problem
statement. Artists traditionally create layered documents, and AI creates flat single-layer documents; significant time
and labor would be required to accurately cut out all the pieces of an AI-generated image to prepare it for a layered
workflow. We are evaluating:

 The amount of time these workflows take to complete their task.

 Whether any of the objects supposed to be in the image return a blank mask due to the object not being
detected.

 Whether the result of creating the layers for the document produced an image that is visually similar to the
original flat image generated by the AI.

Our pipeline can track:

 The amount of time taken for a pipeline.

 The percentage of objects that returned no valid mask.

 The Structural Similarity Index (SSIM) score between the original and the layered image.

Structural Similarity Index (SSIM)

The Structural Similarity Index (SSIM) is a metric used to measure the similarity between two images to assess the
overall quality. It is used in evaluating compression algorithms, transmission formats, and generally, methods where
a distortion-free image can be compared to an image that has undergone a distortion. This method is provided in the
scikit-image library, offering a valuable measurement of image quality. SSIM produces a score based on three
separate factors, providing more insight about the difference between images than simply producing an image
difference or mean squared error from a pixel-wise comparison. It considers luminance, contrast, and structure by
evaluating brightness and contrast differences between the images, assigning scores to patches of differences
throughout the image, and then finding the mean score of all the patches of difference.

SSIM ( x , y )=
[ ( 2∗μx∗μy +C 1 ) ( 2∗σxy+ C 2 ) ]
[ ( μ x 2+ μ y 2 +C 1 ) ( σ x 2 +σ y 2 +C 2 ) ]
Where:

 μx and μy = mean pixel sample from image x, image y

 σx^2 and σy^2 = variance of image x, image y

11
 σxy = covariance of images x and y

 C1 and C2 are “constants to stabilize the division with weak denominators,” defined as:

o C1 = (K1 * L)^2, C2 = (K2 * L)^2

 L = dynamic range of pixel values (usually 2^(bits-per-pixel)-1)

 K1 and K2 are small constants (default values are K1=0.01and K2=0.03):

o K1 = 0.01, K2 = 0.03 (default)

For each image we evaluate, our pipeline will accept the original image and the reconstruction of the image by
assembling layers, and compute the SSIM to assess how closely the reconstructed image matches the original.

Edge Match Ratio

This method evaluates the accuracy of image segmentation masks by comparing their edges to actual edges in the
source image. It's tolerant of small misalignments (allowing 1-pixel deviation) and produces a ratio between 0 and 1
indicating how well the mask edges align with actual image features. The scoring is particularly useful for
evaluating the quality of automated image segmentation or masking tools, where perfect pixel-perfect alignment
might not be necessary but general accuracy is important.

Here's how it works mathematically:

1. Edge Detection & Processing

 The source image undergoes Canny edge detection, followed by Gaussian blur (with blur_factor =
0.75)

 The result is then thresholded (threshold_value = 0.001 * 255)

 For the mask image, it extracts contours and draws them as single-pixel edges

2. Edge Matching Algorithm

 Check for direct matches
 Check 8-connected neighborhood for near matches
 Direct hits and near hits count as 1
 Score = (Number of matches) / (Total number of edge pixels in mask)

Canny edge detection is particularly well-suited for this mask evaluation task for several reasons:

1. Two-Threshold Approach (Hysteresis)

 Canny uses two thresholds (100, 200 in ours) to identify strong and weak edges

 Weak edges are only included if they connect to strong edges

 This reduces noise while preserving important edge continuity, making the comparison with mask
edges more reliable

2. Pre-Processing

 Canny automatically performs Gaussian smoothing as part of its process

 This initial smoothing helps remove noise before edge detection

3. Thin Edges

12
 Canny produces single-pixel width edges, which works well with the single-pixel contour perimeters
evaluated in our masks

 Makes pixel-to-pixel comparison more meaningful since both sets of edges are similarly thin

This combination of features makes Canny more reliable than simpler approaches like Sobel or simple thresholding
would be. The clean, thin edges it produces are ideal for comparing against mask boundaries, while its noise
resistance helps ensure the comparison focuses on meaningful image features rather than artifacts.

Section 4. Final Experimental Results and

Discussions
Since the objective is to determine which SOTA method produces images that work best with object detection, we
consider the following:

1. Layer Integrity: Ensure each layer is coherent and usable independently.

2. How well does the set of image layers match the original? If the image generating model worked well for
the task, then the object detection should make fewer errors and create a layered version of the original.
a. After separating each area marked as an object into a separate image, use Python to re-assemble
the pieces and see how closely they make the right picture.
b. Python’s skimage library performs structural similarity score for images, rating the closeness in
shape, color, and composition.
3. Time: How much time is required for the pipeline to segment this image
4. How well do the edges of layers align with the perceptual edges of objects in the image?

We have created a set of standardized prompts in a dataset that we will use to generate images for testing purposes.
An example of some prompts are as follows:

prompt1:: tall building, large window, city, giraffe's head, human arm, sparkling necklace, shadowy figure, dark
hair::A tall building with a large window overlooking a city, a giraffe's head peeking out from behind the window, a
human arm reaching out to touch the giraffe, a sparkling necklace hanging from the giraffe's neck, and a shadowy
figure with dark hair watching from the background.

prompt2:: bed, building, teddy bear::A cozy bed nestled in the corner of a grand, towering building, with a soft,
brown teddy bear resting on top.

The full evaluation was run for 100 prompts on 3 batches of images (300 samples to test).

Analysis of Segmentation Methods

At this point, we will have a database in which every original image will be scored by each of our state-of-the-art
methods on how well it performs in time, prompt adherence, structural similarity, and edge accuracy. From this, we
hope to identify the benefits and disadvantages of each method, recommend future work to improve these methods,
and determine whether one method or a combination of methods selected based on the prompt has performed best.

By evaluating these methods, we aim to identify which segmentation techniques offer the best performance for our
specific application of layering AI-generated images based on text prompts. We also consider the missing item ratio,
but this factor can be mitigated in the future by using our findings in this project to train a more comprehensive
model on a greater number of objects and masks.

CLIPSeg Detectron HIPIE YOLO ODISE

13
Time (seconds): Mean: 1.9188 Mean: 20.3269 N/A 492 detections in Mean: 241.74
Std: 0.0560 Std: 0.5493 12.04 seconds Std: 0.27
527 detections in
12.53 seconds
435 detections in
12.44 seconds
SSIM: Mean: 0.7068 Mean: 0.3779 N/A
Std: 0.1341 Std: 0.1427
Average Edge Mean: 0.4660 Mean: 0.1096 N/A
Match Ratio: Std: 0.1326 Std: 0.1072
Missing Item Mean: 0.1739 Mean: 0.8382 N/A
Ratio: Std: 0.1454 Std: 0.1503

We encountered technical issues with the HIPIE implementation, where the method required custom training using
an outdated version of Nvidia’s Cuda Toolkit. The implementation from the published paper included a clone of
Detectron, used for initial training of the backbone piece of the method, and then relied upon this custom training
method to achieve the hierarchical “things versus stuff” separation that HIPIE is supposed to be able to perform. Our
initial investigation into this method seemed to show that sample model weights had been supplied, and it was not
until many weeks of work that we realized these weights weren’t working properly because of the requirement for
the intermediate training step. The hardware recommendation for this step greatly exceeded what we had available,
and so we were left with a classifier that basically is no different from Detectron, pre-trained on a 90 object resnet.

Head-To-Head Assessment

Although all of the methods we investigated have benchmarks comparing their performance on Imagenet, we found
that many of the pretrained weights available were for much smaller data sets or had other limitations like the
training required for HIPIE mentioned above.

While our assessment generally bears in mind that we’re not looking for the highest score, but rather the relative
scores of these methods, we realized part way through the project that we were going to also need to deal with the
fact that some of these methods would have a more significant number of missing object detections than others, to
an extent that would bias their performance. To that end, we updated our reporting to include per layer edge match
ratio scores, and we wrote a custom assessment program that runs off the evaluation reports head to head in order to
only compare our accuracy scoring metric for objects that were successfully detected in both methods being
compared. This gives us a better perspective on how successfully these methods work compared to one another,
irrespective of whether they have been trained to recognize a specific object.

CLIPSeg Detectron
Total wins per method 897 wins 3 wins
Processing Time (lower is better) 1.9188 ± 0.0560 20.3269 ± 0.5502
SSIM Score (closer to 1 is better) 0.7068 ± 0.1343 0.3779 ± 0.1429
Avg Edge Match (closer to 1 is better) 0.4660 ± 0.1328 0.1096 ± 0.1074
Wins in head-to-head comparisons 132 wins 68 wins
Average scores for items in common 0.4385 ± 0.2383 0.3035 ± 0.2753

We were not able to get every kind of metric and every kind of head to head score for all five of our methods against
one another because of some unexpected problems that cropped up over the course of the semester with computing
resources. The ODISE method relies on several pieces of detectron 2, and specifically on a modified version of
detectron 2 that requires pytorch and specifically an older version of pytorch (1.7). The distributable release of this
version, with setup and dependencies is no longer offered by Facebook and so significant efforts were made to get it
running locally, but the group member who was looking into this method could not run it on his Macbook with

14
silicon processor. This didn't entirely stop him and instead he got the whole thing running in a collab document with
the help of the WPI helpdesk , but it meant that it took about 20 minutes to start a procession and get all of the
dependencies running every time he needed to step away and return to the coat again, and this became a very serious
problem with getting the code to reliably run in our evaluations. Instead, this team member manually ran the
evaluation functions for a large batch of images and then produced the calculations come up but not the runoff
scoring. Similarly, the group member who was working on Poly Yolo, which is a segmentation variation of Yolo,
discovered that it depended upon weights for Yolo version 3, while Yolo 5 weights are the only ones we could find
commonly distributed for that method. Model weights that are commonly available for V 5 are stored in Keras
format, but we needed something called darknet format, which appears to be an older variant of tensor database. To
get around this, the group member who was investigating Yolo worked on an updated segmentation method using
version 5 and a custom cropping function built into the detection, but the results produced rough cutouts made out of
rectangles as the closest approximation he could get without a supercomputer cluster available to retrain polyolo.

Ultimately, we determined that ODISE was almost incomprehensibly slow and Yolo was the fastest method but
essentially incapable of segmentation, with no significant way to improve it through because the yolo 5 method
Does not represent its understanding of an object in a picture as a per pixel prediction, requiring us to use a kind of
Riemann’s sum box method. Focusing on the two methods that fully worked in the way that we had hoped for this
project, detecting objects and pictures and producing cutouts of them, the CLIPSeg method was easily our pick with
the recommendations we will cover in the Future Work section about how we would like to see it improved in a
future version of this task.

Section 5: Lessons Learned

We have been startled by the success of CLIPSeg; it is a pretty new method, but designed with a frozen backbone
from a relatively older network. The enormous dataset and rigor with which CLIP was trained seem to have granted
a significant benefit to this new segmentation technique. We expected CLIPSeg to out-perform Detectron, for
example, on the overall average SSIM and Edge Match Ratio because the model we were able to work with from
Detectron was a relatively small classifier that could not identify many of the objects we asked it to find. But we
also performed a head-to-head analysis for this reason, so that we could determine whether it was worth the
computational expense to train Detectron on a much larger object set, and even in this assessment. This head-to-head
scoring only considered objects that were successfully segmented by each method in the scoring, which eliminated
the advantage of CLIPSeg knowing about more objects. Even in this category though, we saw a much smaller but
still significant advantage with the CLIPSeg method, suggesting that it would be more beneficial to train that
method on a larger mask data set rather than training detectron on a larger object data set.

Perhaps the most significant lesson learned in evaluating these methods is the picture they paint about computational
resources. From our literature review, we felt that we would probably be trading between speed and accuracy in
these methods and that has certainly held true. We see for example that CLIPSeg is consistently pretty good and
pretty fast, but with an edge accuracy of about 40%, this is not particularly useable for layered image editing.
Detectron had tremendous performance gains when we factored out objects it could not find, but the mask accuracy
still fell slightly short of ClipSeg’s. This would seem to indicate that if we could train detectron on a much larger
data set, its accuracy could become the best one of our methods, but we also saw that detectron was slow, averaging
20 seconds to find all of the objects in an image and cut them out.

A final lesson learned, which might be a bit humorous but was certainly a major headache for our entire team, was
to beware of Detectron. Each of the methods that relied on it caused significant time and difficulty for the group
member involved. Three of our group members needed to have some form of detectron or the various tools that it
provides as part of their method, and the most successful of those three got the dependence these two install after 6
hours of continuous work. The other two group members worked nightly on chipping away at the bizarre errors and

15
dependencies for over two weeks before they got something that worked reliably. And trying to get all of these to
work within one environment so that we could run the evaluation pipeline Turned out to be technically impossible.
After long experience, it seems that many versions of Detectron are built with extremely narrow requirements, often
down to a single version of pytorch, or a single type of processing architecture, and many variations are no longer
distributed. Some of the others we encountered required us to downgrade Numpy to a significantly earlier version
incompatible with most of the layer fusion functionality. Detectron is a very powerful platform, as was evident by
how many unique and interesting methods hooked its various tools as essential parts of their experiments but it
quickly became apparent that it is best suited for single vector applications, where one specific install is designed for
one specific platform with 1 specific task to perform. Attempting to deploy it flexibly to locate objects in photos,
perform segmentation and as a step in various other segmentation pipelines with their own custom networks, proved
to be too many different things to try to get it to work with it once. We did not know that this was going to be the
case back in the second week of the term when we were choosing our methods, and so this was a significant lesson
learned regarding requirements and things to look out for with future projects that involve combining multiple state
of the art methods.

Section 6. Conclusions and Future Work

Given the remarkable performance of the CLIPSeg method compared to other segmentation techniques, it is a great
technique for many kinds of segmentation tasks. For our layer fusion project, even though it was the highest average
scoring method and also generally pretty fast, its accuracy was not good enough to replace the work of a skilled
professional digital artist. Even though it can separate an image into component parts in just about 2 to 3 seconds
while a professional digital artist might need 5 to 10 minutes the immediate gains in work savings would be lost
again because the artist would inevitably need to manually refine the edges of the cut outs made by this particular
segmentation technique.

Based on our experiments with CLIPSeg, we propose several promising directions for future research to enhance the
model's segmentation capabilities. First, we suggest making fundamental improvements to the decoder architecture.
By incorporating additional transformer layers and multi-scale feature processing, the model could better capture
both fine-grained details and broader contextual information. The current simple linear projection must be replaced
with a more sophisticated upsampling architecture to generate more precise segmentation boundaries. We make this
suggestion on the basis of an example provided by the authors of CLIPSeg after publication, in which they
singnificantly improved per-pixel accuracy by exploring different connections in this part of the architecture.

Training methodology also presents significant opportunities for advancement. We propose exploring a broader
range of data augmentation techniques to improve the model's robustness and generalization capabilities.
Additionally, expanding the training dataset to include more diverse scenarios and object types could enhance the
model's real-world applicability. The introduction of specialized loss functions that specifically target boundary
accuracy could lead to more precise segmentation results, particularly in challenging cases with complex object
boundaries. We base this idea on the relatively limited PhraseCut dataset; while it contains 360,000 language/mask
pairs, the masks in the dataset are not always very precise. By improving dataset (possibly recruiting professionals to
assist with ground truth masks) we feel the training could improve.

Finally, we propose several architectural enhancements that could significantly improve performance. Exploring
additional skip connections from earlier CLIP layers could help preserve fine-grained spatial information throughout
the network. The addition of multi-resolution processing paths or more pooling layers could enable the model to
better handle objects at varying scales. Working with the relatively older CLIP required us to scale images to
352x352 for our initial layers. Though the CLIPSeg method works for larger resolutions, we suspect that nuance is
lost with this small file size requirement.

16
These proposed adjustments could collectively enhance CLIPSeg's ability to generate more precise and reliable
segmentation masks while maintaining its flexibility in handling both text and image prompts. Future research could
easily be plugged into our evaluation pipeline and toolset to track improvements.

Background research:
SDXL (Image generator):

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., & Rombach, R.
(2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv.
https://arxiv.org/abs/2307.01952

CLIPSeg Text-to-Image Segmentation (Object detection):

Lüddecke, T., & Ecker, A. S. (2022). Image segmentation using text and image prompts. arXiv.
https://arxiv.org/abs/2112.10003

ODISE Text-to-Image Segmentation (Object detection):

Xu, J., et al. (2023). Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv.
https://arx iv.org/abs/2303.04803

YOLO - You Only Look Once Image Detection:

Poly-YOLO. (n.d.). Poly-YOLO: Higher speed, more precise detection, and instance segmentation for
YOLOv3. SpringerLink. https://link.springer.com/article/10.1007/s00521-021-05978-9

Song Q, Li S, Bai Q, Yang J, Zhang X, Li Z, Duan Z. Object Detection Method for Grasping Robot Based on
Improved YOLOv5. Micromachines. 2021; 12(11):1273. https://doi.org/10.3390/mi12111273

HIPIE – Hierarchical Open-vocabulary Universal Image Segmentation:

Xu, J., Han, Z., Xu, H., Zhang, C., Huang, K., & Bai, X. (2023). Hierarchical open-vocabulary universal
image segmentation. Papers with Code. https://paperswithcode.com/paper/hierarchical-open-vocabulary-
universal-image-1

Detectron2 – Platform for Object Detection and Segmentation:

Abhishek, A. V. S. (2021). Detectron2 object detection & manipulating images using cartoonization.
IJERT. https://www.ijert.org/research/detectron2-object-detection-manipulating-images-using-
cartoonization-IJERTV10IS080122.pdf

Swin Transformer for Object Detection and Segmentation:

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer:
Hierarchical vision transformer using shifted windows. IEEE.
https://ieeexplore.ieee.org/document/9710580

CLIP-Guided Image Generation:

17
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural
language supervision. arXiv. https://arxiv.org/abs/2103.00020

MiDaS for Depth Estimation:

Ranftl, R., Lasinger, K., Hafner, D., et al. (2020). MiDaS: Mixed dataset training for monocular depth
estimation. arXiv. https://arxiv.org/abs/1907.01341

Denoising Diffusion Probabilistic Models for Inpainting:

Lugmayr, A., Danelljan, M., Romero, A., et al. (2022). Denoising diffusion probabilistic models for image
inpainting. arXiv. https://arxiv.org/abs/2201.09865

LangXAI: Textual Explanations for Visual Recognition Tasks:

Nguyen, T. T. H., Clement, T., Nguyen, P. T. L., Kemmerzell, N., Truong, V. B., Nguyen, V. T. K.,
Abdelaal, M., & Cao, H. (2024). LangXAI: Integrating large vision models for generating textual
explanations to enhance explainability in visual perception tasks. GitHub.
https://github.com/hungntt/LangXAI

References:
6sense. (n.d.). Companies using Adobe Photoshop. https://6sense.com/tech/graphic-design-software/adobe-
photoshop-market-share

Adobe Inc. (n.d.). Layer basics. Adobe Photoshop user guide. https://helpx.adobe.com/photoshop/using/layer-
basics.html

Boesch, G. (2024, February 11). Detectron2: A rundown of meta’s computer vision framework. viso.ai.
https://viso.ai/deep-learning/detectron2

Bytez. (2023). Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models.

https://bytez.com/docs/cvpr/23186/paper

Kamali, N., Nakamura, K., Chatzimparmpas, A., Hullman, J., & Groh, M. (2024). How to distinguish AI-generated
images from authentic photographs. arXiv. https://arxiv.org/abs/2406.08651

Kerr, M. P. M. (2022). Developing Fluid: Precision, Vagueness and Gustave Le Gray’s Photographic Beachscapes.
In Coastal Cultures of the Long Nineteenth Century (pp. 200–224). Edinburgh University Press.
https://doi.org/10.1515/9781474435758-015

NVlabs/ODISE. (2023). ODISE. https://github.com/NVlabs/ODISE

Verified Market Research. (2024, July). Digital painting market size and forecast.
https://www.verifiedmarketresearch.com/product/digital-painting-market

Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image
Processing, 3(5). https://ieeexplore.ieee.org/document/334981

Wikipedia. (n.d.). Layers (digital image editing). https://en.wikipedia.org/wiki/Layers_(digital_image_editing)

Wikipedia. (n.d.). Multiplane camera. https://en.wikipedia.org/wiki/Multiplane_camera

18
Wikipedia. (n.d.). Traditional Animation. https://en.wikipedia.org/wiki/Traditional_animation

Yahoo Finance. (2024, October). https://finance.yahoo.com/quote/ADBE/history/?

guccounter=1&period1=1571629413&period2=1729482182

A_Study_on_the_Influence_of_Artificial_Intelligenc
No ratings yet
A_Study_on_the_Influence_of_Artificial_Intelligenc
4 pages
Essay 3 final (1)
No ratings yet
Essay 3 final (1)
6 pages
1.a Still Images
No ratings yet
1.a Still Images
12 pages
The Future of Photo Editing
From Everand
The Future of Photo Editing
Ali Alsiad
No ratings yet
Interview With Brian Sykes - Sintetica
No ratings yet
Interview With Brian Sykes - Sintetica
5 pages
Response Paper 2-Aras Hekimoğlu - 1903883
No ratings yet
Response Paper 2-Aras Hekimoğlu - 1903883
2 pages
Digital Image Processing
No ratings yet
Digital Image Processing
9 pages
Separate and Reassemble Generative AI An
No ratings yet
Separate and Reassemble Generative AI An
26 pages
Parag
No ratings yet
Parag
20 pages
Generative AI Art - A Beginner's Guide To 10x Your Thon & Statistics For Beginners) - Oliver Theobald
100% (1)
Generative AI Art - A Beginner's Guide To 10x Your Thon & Statistics For Beginners) - Oliver Theobald
116 pages
01. Generative Art Processes & Practices, Part I- Initiate- 7.19.24
No ratings yet
01. Generative Art Processes & Practices, Part I- Initiate- 7.19.24
95 pages
sem 8 report (1)
No ratings yet
sem 8 report (1)
36 pages
AI Art in Architecture
No ratings yet
AI Art in Architecture
11 pages
Essay 3 Draft
No ratings yet
Essay 3 Draft
2 pages
english final essay (1)
No ratings yet
english final essay (1)
7 pages
Computing Colour Image Processing Alan Parkin All Chapters Instant Download
100% (3)
Computing Colour Image Processing Alan Parkin All Chapters Instant Download
62 pages
Final_Research_Paper
No ratings yet
Final_Research_Paper
5 pages
Chapter
No ratings yet
Chapter
6 pages
Top AI Tools to Learn and Stay Ahead - Secure Your Future Now
From Everand
Top AI Tools to Learn and Stay Ahead - Secure Your Future Now
Hema
No ratings yet
Portfolio Research Paper
No ratings yet
Portfolio Research Paper
14 pages
Image Processing Unit 1 PDF
No ratings yet
Image Processing Unit 1 PDF
14 pages
IEEE Template
No ratings yet
IEEE Template
5 pages
Final_Draft
No ratings yet
Final_Draft
5 pages
Photoshop 2024: The Future of Image Editing
From Everand
Photoshop 2024: The Future of Image Editing
Ali Alsiad
No ratings yet
2408.00544v1
No ratings yet
2408.00544v1
7 pages
CSEP590A - History of Computing's Effects On The Creative Industry Russell Clarke (Microsoft)
No ratings yet
CSEP590A - History of Computing's Effects On The Creative Industry Russell Clarke (Microsoft)
11 pages
Manovich Ai Image and Generative Media
No ratings yet
Manovich Ai Image and Generative Media
17 pages
Image Generation A Review
No ratings yet
Image Generation A Review
39 pages
Lexicon of Graphic Designer Terminology: Lexicon of Tech and Business, #21
From Everand
Lexicon of Graphic Designer Terminology: Lexicon of Tech and Business, #21
Mustafa Al-Dori
No ratings yet
Download Computing Colour Image Processing Alan Parkin ebook All Chapters PDF
100% (2)
Download Computing Colour Image Processing Alan Parkin ebook All Chapters PDF
65 pages
Computing Colour Image Processing Alan Parkin download
100% (3)
Computing Colour Image Processing Alan Parkin download
62 pages
Seven Arguments About AI Images and Gene
No ratings yet
Seven Arguments About AI Images and Gene
25 pages
2412.16531v1
No ratings yet
2412.16531v1
17 pages
3689641
No ratings yet
3689641
22 pages
The_algorithmic_art_Exploring_the_inters
No ratings yet
The_algorithmic_art_Exploring_the_inters
25 pages
AI Generated Art - Improved
No ratings yet
AI Generated Art - Improved
1 page
2501.02725v1
No ratings yet
2501.02725v1
68 pages
Synopsis
No ratings yet
Synopsis
11 pages
Computing Colour Image Processing Alan Parkin all chapter instant download
100% (1)
Computing Colour Image Processing Alan Parkin all chapter instant download
55 pages
Stoppel 2019 LFL
No ratings yet
Stoppel 2019 LFL
15 pages
Photo Editing Software
No ratings yet
Photo Editing Software
24 pages
Texture Synthesis For Digital Painting: John Peter Lewis Massachusetts Institute of Technology
No ratings yet
Texture Synthesis For Digital Painting: John Peter Lewis Massachusetts Institute of Technology
7 pages
Elements of Image Processing System
No ratings yet
Elements of Image Processing System
6 pages
A Beginner's Guide to Digital Art
From Everand
A Beginner's Guide to Digital Art
BySilent
No ratings yet
Module-ARTS-Q2
No ratings yet
Module-ARTS-Q2
4 pages
Detection of AI Generated Images
No ratings yet
Detection of AI Generated Images
6 pages
Science Adh4451
No ratings yet
Science Adh4451
3 pages
Capabilities Limitations and Challenges of Style T
No ratings yet
Capabilities Limitations and Challenges of Style T
20 pages
Wk4_AI Generated Images
No ratings yet
Wk4_AI Generated Images
30 pages
The Federal University of Technology
No ratings yet
The Federal University of Technology
9 pages
Computing Colour Image Processing Alan Parkin - Download the ebook now for instant access to all chapters
100% (4)
Computing Colour Image Processing Alan Parkin - Download the ebook now for instant access to all chapters
64 pages
DIP Notes Unit1,3,4,5
No ratings yet
DIP Notes Unit1,3,4,5
72 pages
Digital Image Processing Full Report
No ratings yet
Digital Image Processing Full Report
9 pages
Piskopani Et Al 2023 Responsible Ai and
No ratings yet
Piskopani Et Al 2023 Responsible Ai and
5 pages
Three Dimensional Computer Graphics: Exploring the Intersection of Vision and Virtual Worlds
From Everand
Three Dimensional Computer Graphics: Exploring the Intersection of Vision and Virtual Worlds
Fouad Sabry
No ratings yet
Artificial Intelligence in The Creative Industries: A Review
No ratings yet
Artificial Intelligence in The Creative Industries: A Review
68 pages
NUWA-Infinity: Autoregressive Over Autoregressive Generation For Infinite Visual Synthesis
No ratings yet
NUWA-Infinity: Autoregressive Over Autoregressive Generation For Infinite Visual Synthesis
24 pages
DIP Merged
No ratings yet
DIP Merged
256 pages
Lecture_1&2
No ratings yet
Lecture_1&2
36 pages
Print Production: Digital Images
No ratings yet
Print Production: Digital Images
24 pages
Restoration_of_artwork_using_deep_neural
No ratings yet
Restoration_of_artwork_using_deep_neural
8 pages
Intel Unnati Problem Statement for Industrial Training
No ratings yet
Intel Unnati Problem Statement for Industrial Training
18 pages
TrueFace_a_Dataset_for_the_Detection_of_Synthetic_Face_Images_from_Social_Networks
No ratings yet
TrueFace_a_Dataset_for_the_Detection_of_Synthetic_Face_Images_from_Social_Networks
7 pages
Dual-dimensional Dependency Fusion Transformer for Long-Term Spatiotemporal Flow Prediction
No ratings yet
Dual-dimensional Dependency Fusion Transformer for Long-Term Spatiotemporal Flow Prediction
8 pages
Information transmission via single-pixel coherent detection
No ratings yet
Information transmission via single-pixel coherent detection
9 pages
NTIRE_2025_Challenge_on_Image_Denoising__Methods_and_Results (3)
No ratings yet
NTIRE_2025_Challenge_on_Image_Denoising__Methods_and_Results (3)
28 pages
FAMM_Facial_Muscle_Motions_for_Detecting_Compressed_Deepfake_Videos_Over_Social_Networks
No ratings yet
FAMM_Facial_Muscle_Motions_for_Detecting_Compressed_Deepfake_Videos_Over_Social_Networks
16 pages
CSC8208_ Evaluating Your Project
No ratings yet
CSC8208_ Evaluating Your Project
45 pages
A_Pre-trained_YOLO-v5_model_and_an_Image_Subtraction_Approach_for_Printed_Circuit_Board_Defect_Detection
No ratings yet
A_Pre-trained_YOLO-v5_model_and_an_Image_Subtraction_Approach_for_Printed_Circuit_Board_Defect_Detection
6 pages
s00530-025-01713-9
No ratings yet
s00530-025-01713-9
14 pages
Omni Control
No ratings yet
Omni Control
18 pages
Advanced Pothole Detection and Repair Recommendation System Using Computer Vision Techniques
No ratings yet
Advanced Pothole Detection and Repair Recommendation System Using Computer Vision Techniques
11 pages
Implementation of Virtual Try-On for Clothing Products Using Deep Learning Methods
No ratings yet
Implementation of Virtual Try-On for Clothing Products Using Deep Learning Methods
6 pages