0% found this document useful (0 votes)
26 views10 pages

Geochat: Grounded Large Vision-Language Model For Remote Sensing

Uploaded by

22051925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Geochat: Grounded Large Vision-Language Model For Remote Sensing

Uploaded by

22051925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

GeoChat : Grounded Large Vision-Language Model for Remote Sensing

Kartik Kuckreja1, 2 * Muhammad Sohail Danish1* Muzammal Naseer1


Abhijit Das2 Salman Khan1, 3 Fahad Shahbaz Khan1, 4
1 2
Mohamed bin Zayed University of AI, Birla Institute of Technology & Science, Hyderabad
3
Australian National University, 4 Linköping University
arXiv:2311.15826v1 [cs.CV] 24 Nov 2023

[email protected], [email protected]

Abstract Grounded Image Captioning


USER: Describe the image in detail.
ASSISTANT: The image depicts a scene with
multiple buildings, houses, trees, and small
Recent advancements in Large Vision-Language Mod- vehicles. There are 10 tennis courts visible in the
image…..
els (VLMs) have shown great promise in natural image do-
mains, allowing users to hold a dialogue about given vi- Visual QA Scene Classification
USER: How many USER: Is this a rural
sual content. However, such general-domain VLMs perform tennis courts are or an urban area?
visible?
poorly for Remote Sensing (RS) scenarios, leading to inac- ASSISTANT: 10 ASSISTANT: This
tennis courts are is an aerial view of an
curate or fabricated information when presented with RS present in the image. urban area.

domain-specific queries. Such a behavior emerges due to Referring Expression Region Based
the unique challenges introduced by RS imagery. For exam- USER: Tennis court at Captioning
the bottom left. USER: What is the
ple, to handle high-resolution RS imagery with diverse scale ASSISTANT: Tennis object present here?
court at the bottom ASSISTANT: A
changes across categories and many small objects, region- left. white small car.

level reasoning is necessary alongside holistic scene inter- Multi-turn Conversation


USER: How many tennis courts are there in the image?
pretation. Furthermore, the lack of domain-specific mul- ASSISTANT:…………..
USER: Can you describe the surroundings of the tennis courts?
timodal instruction following data as well as strong back- ASSISTANT: The tennis courts are surrounded by multiple
buildings and houses. There are also cars parked nearby.
bone models for RS make it hard for the models to align USER: ……
ASSISTANT:…..
their behavior with user queries. To address these lim-
itations, we propose GeoChat - the first versatile remote
sensing VLM that offers multitask conversational capabil- Figure 1. GeoChat can accomplish multiple tasks for remote-
sensing (RS) image comprehension in a unified framework. Given
ities with high-resolution RS images. Specifically, GeoChat
suitable task tokens and user queries, the model can generate visu-
can not only answer image-level queries but also accepts ally grounded responses (text with corresponding object locations
region inputs to hold region-specific dialogue. Further- - shown on top), visual question answering on images and regions
more, it can visually ground objects in its responses by re- (top left and bottom right, respectively) as well as scene classifi-
ferring to their spatial coordinates. To address the lack cation (top right) and normal natural language conversations (bot-
of domain-specific datasets, we generate a novel RS mul- tom). This makes it the first RS VLM with grounding capability.
timodal instruction-following dataset by extending image-
text pairs from existing diverse RS datasets. We establish
a comprehensive benchmark for RS multitask conversations
tations facilitate effective self-supervised vision-language
and compare with a number of baseline methods. GeoChat
modeling, as demonstrated by multimodal GPT-4 [23] and
demonstrates robust zero-shot performance on various RS
open-source initiatives like LLaVA [19]. These vision-
tasks, e.g., image and region captioning, visual question
language models (VLMs), developed through generative
answering, scene classification, visually grounded conver-
pretraining and instruction-tuning, exhibit robust zero-shot
sations and referring detection. Our code is available here.
task completion across various user-oriented multimodal
tasks. The resulting capabilities open the door to the de-
1. Introduction velopment of versatile multimodal conversational assistants
In the natural image domain, the abundance of aligned with broad applications in real-world scenarios [12].
image-text data sourced from web imagery or manual anno-
However, general-domain VLMs designed for natural
* Equally contributing first authors. images, exhibit poor performance when presented with re-

1
motely sensed visual imagery. The performance disparity GeoChat to retain the conversation and instruction follow-
arises primarily from the distinct nature of content found ing abilities of LLaVA and extend its domain-knowledge
in remote sensing image-text pairings compared to the pub- to remote sensing tasks.
licly available web data. As a result, general-domain VLMs • We also address the lack of evaluation benchmarks to as-
can provide inaccurate information or hallucinate when pre- sess the capability of existing VLMs on remote-sensing
sented with spatial images from RS sensors. Although there conversations. To this end, we setup evaluation proto-
has been significant progress in the field of remote sensing cols for conversation grounding in RS, as well as a setup
visual question answering (VQA) [38, 40], earlier meth- a suite of tasks to allow comparisons with future efforts
ods have framed the task as a classification problem. Here, in this direction. We show various supervised as well as
the model chooses answers from predetermined responses zero-shot evaluations for different remote sensing tasks,
found in the training data. It limits their applicability to including image captioning, visual question answering
open-ended answer generation and instruction-following. and scene classification to demonstrate the generalisabil-
In this paper, we introduce GeoChat, an attempt to ity of GeoChat conversational VLM.
extend multimodal instruction-tuning to the remote sens-
ing domain for training a multitask conversational assis- 2. Related Work
tant. However, remote-sensing domain lacks a multimodal
Large Vision-Language Models. The typical architecture
instruction-tuning conversational dataset. Inspired by re-
of instruction-following Vision Language Models (VLMs)
cent work in instruction-tuning [14, 19, 41], GeoChat uses
consists of utilising a pre-trained visual backbone[9] to en-
Vicuna-v1.5 [7] and an automated pipeline to generate
code visual data, a large language model [7] for interpreting
diverse remote sensing multimodal instruction-following
user instructions and generating responses, and a vision-
data comprising of nearly 318k instructions. We create
language cross-modal connector, e.g., a linear projection
the image-text pairs from various existing remote sensing
layer [18, 41] or an MLP [17], for fusing visual information
datasets developed for diverse tasks. These includes LR-
with language models. The results achieved with VLMs
BEN for VQA [20], NWPU-RESISC-45 for scene classifi-
show great promise; for example, LLaVA [18], Instruct-
cation [5] and SAMRS for object detection [30].
BLIP [8], Otter [13] and MiniGPT-4 [41] show remarkable
A crucial capability of GeoChat is the unification of mul- gains in language instruction following and visual reasoning
tiple image and region-level reasoning tasks for RS imagery ability for natural scenes. More recent studies have shown
within a single pipeline (see Fig. 1). We achieve this via that these models can be adapted to other domains such as
distinct task tokens that help suitably direct the model’s re- videos [22], biomedical [14, 29] and remote sensing [12].
sponses according to user requirements. In addition, the Remote Sensing VLMs. The application of generalized
model uses spatial location representations in its inputs to VLMs in remote sensing is comparatively sparse. The ma-
seamlessly reason about local regions and can also generate jority of research so far has neglected the semantic under-
object locations in its responses to visually ground objects. standing of the items and their relationships towards a deep
This enables a diverse set of tasks possible with GeoChat visual comprehension. Beyond merely identifying the ob-
including referring expression detection, image/region cap- jects in an image, vision-language models are also capa-
tioning, scene classification, natural language conversations ble of generating natural language descriptions of the im-
and VQA, besides visually grounded conversations. age and inferring the connections between the objects. This
In summary, this work has the following contributions: makes them more appropriate for tasks like text-based im-
• RS multimodal instruction following dataset. We present age retrieval, captioning images, and answering visual ques-
a novel data generation pipeline, to leverage existing ob- tions that call for both visual and linguistic knowledge. Al-
ject detection dataset [30] to create short descriptions of though there has been progress in vision language models
the images, followed by using Vicuna-v1.5 [7] to cre- for remote sensing tasks, such as image captioning [42],
ate conversations using the generated text alone. Further, zero-shot classification [16] and visual question answering
we add visual question-answering and scene classification [3, 38], these models can only perform a specific task they
abilities using their corresponding datasets [5, 20]. This are trained for, lack conversational capability and do not
results in a total of 318k instruction pairs for RS domain. possess generic semantic knowledge about the remote sens-
• GeoChat. Leveraging our dataset, we finetune LLaVA- ing images. A major gap exists in the remote sensing do-
1.5 [14] to create the remote sensing-domain vision- main towards developing general-purpose models to solve
language model - GeoChat. Our LoRA [11] fine-tuning is all tasks together, while also maintaining conversation abil-
efficient and avoids forgetting the necessary context em- ities. While RSGPT [12] is an initial effort that has shown
bedded in fully-tuned LLaVA model, whose MLP pro- good conversation ability along with solving multiple tasks,
jection is trained to align images into the word embed- it requires finetuning the model for each task separately,
ding space of the LLM (Vicuna-v1.5 [7]). This allows which makes it cumbersome and not generalizable. Fur-

2
ther, RSGPT cannot work for region-level reasoning or vi- Data Size Response formatting prompts
Detailed Description 30k Describe the image in detail.
sual grounding, which our work aims to address.
Multi-Round Conversation 65k -
Complex Questions 10k -
3. GeoChat: Grounded Remote Sensing VLM RSVQA-LRBEN[20] 56k Answer the question using a single word or phrase.
NWPU-RESISC-45[5] 31.5k
Visually grounded conversations for remote sensing aim to Floodnet[25] 4k

generate textual responses interleaved with corresponding Grounding Description 45k [grounding] Describe the image in detail.
Region Captioning 40k [identify] {bx left , by top , bx right , by bottom |θ}
object locations. Further, a user can also provide visual Referring Expression 25k [refer] < p > Object < /p >
prompts (e.g., a bounding box) besides natural language
questions, and the model should be able to answer questions Table 1. Instruction following data used to train GeoChat. Instruc-
about the specified Region of Interest (RoI). Such seamless tion types and format are shown. We use a 306k set for training
interplay between visual and language modalities necessi- and a separate 12k instruction-set for testing.
tate a deep comprehension of linguistic constructions that
denote particular objects or elements in a visual scene.
As mentioned above, GeoChat is the first model capable inputs. Further, the original LLaVA can not reason about
of holding visually grounded conversations about remotely remote sensing images which is enabled via our domain-
sensed images. By construction, GeoChat can address not specific dataset. We describe each component in the archi-
only the challenging task of visually grounded conversa- tecture as follows:
tions, but can also perform a spectrum of other spatial rea- Task Token: The unique quality of GeoChat is its abil-
soning tasks that span varying levels of granularity in vi- ity to easily switch between different types of remote sens-
sual imagery understanding e.g., image/region captioning, ing visual interpretation tasks. To eliminate uncertainty
referring object detection and image/region-level conversa- among tasks, our approach assigns a unique task identifi-
tions about remotely sensed images. We formally outline cation to each one. We suggest three distinct task identities,
the tasks possible with GeoChat below. t ∈ {grounding, identify, refer}, each for grounded con-
a) Image-Level Conversation Tasks. In this task, versations, region captioning and referring expression com-
GeoChat processes an image x and a user text query q with- prehension. As for the case of visual question answering
out any specific spatial coordinates in its inputs or outputs. and scene classification, we directly ask the model to out-
The goal is to perform conversation-based tasks at a holistic put the answer in a single word or phrase, as shown in Ta-
level with image-wide context, such as visual question an- ble 1. Our approach does not employ any task identifica-
swering (VQA), scene classification and image captioning. tion tokens for vision-irrelevant commands. This unified
b) Region-Level Conversation Tasks. This task in- approach is supported by a modular design that efficiently
volves providing spatial box locations b in the input to integrates spatial data, giving the model flexibility in its rea-
GeoChat besides x and q. Region locations b guide the soning about visual content.
model’s attention to specific regions within the image, so Spatial Location Representation. Our model must pre-
that the model can perform tasks such as region-level cap- cisely identify the spatial position of the referenced items
tioning, region-specific VQA or multi-turn conversation. for tasks such as grounded conversations, referring expres-
c) Grounded Conversation Tasks. With the use of spe- sion generation, and comprehension. To this end, we repre-
cial tokens, termed as task-specification tokens t, GeoChat sent the box locations in a textual format to express the ge-
can be guided to provide object locations at different gran- ographical position: b = {bx left , by top , bx right , by bottom |θ}.
ularities, while maintaining conversation abilities. It helps Here, bx left , by top denote the top left corner point of box
in tasks including grounded image captioning/conversation, while the bx right , by bottom represent the bottom right corner
object grounding and referring expression detection. coordinates. The angle θ represents the angle of rotation for
the bounding box, from the lower edge. Numerical values
3.1. GeoChat Architecture normalised within the interval [0, 100] are used to represent
GeoChat follows the architecture as of LLaVA-v1.5 [17], the x and y coordinates. Region locations in this format are
which consists of three core components, i) Global Im- used to interact with the model via its inputs and outputs.
age encoder, ii) an MLP adaptor (two linear layers) and Visual Backbone. GeoChat adapts the pretrained vision
iii) LLM. Different to LLaVA, we add specific task prompt backbone of CLIP-ViT(L-14) [28], which has an input res-
that indicates the type of task desired from the model i.e., olution of 336×336. This results in effectively 576 patches
grounding, image-level or region-level conversations. Ad- per image. Since this resolution is not sufficient to un-
ditionally, we allow spatial positions within both inputs and derstand details presented in remote sensing imagery (e.g.,
outputs, enabling visual prompts as inputs and grounded small objects and object details), we interpolate the posi-
objects in GeoChat outputs. Notably, the original LLaVA tional encoding in the transformer-based CLIP [28] model
model cannot perform object grounding or accept region to scale with input image sizes of 504×504. Although this

3
Referring Expression Region captioning Image Description Visual QA
[USER]:[refer] Football [USER]: [identify] What is [USER]:Can you please [USER]: How many football
ground on the right side, present in this area? describe the image in detail? grounds and basketball
surrounded by trees. courts are present in the
The image is an aerial view of
image?
a sports complex, with two There are two
football grounds at the center football grounds
A basketball and one basketball ground at and one
field at the bottom of the the bottom. The football
Sure! basketball ground
image. grounds are surrounded by
trees……. present.

🔥 Large Language Model

<im_start> ❄ MLP Adaptor <im_end> [task specific prompts]

<System Message>
A chat between a curious ❄ CLIP-ViT
human and an artificial
intelligence assistant.
The assistant gives
helpful, detailed, and
polite answers to the
human's questions..

Figure 2. An overview of GeoChat - the first grounded large vision-language model for remote sensing. Given an image input together
with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings.
A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Vicuna
1.5). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task
required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations.
GeoChat can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.

leads to an increase in the number of patches to almost dou- [USER] <im_start> Image Features <im_end> [Task Identifier] [ASSISTANT]
ble (i.e., 1296 per image), this enhanced resolution allows
us to handle larger image sizes and also supports better vi- Figure 3. Multi-task instruction template for GeoChat.
sual grounding in high-resolution RS images.
MLP Cross-modal Adaptor. From the frozen CLIP- Low-Rank Adaptation (LoRA) [11] based strategy is used
ViT[28], we project the output tokens (∈ R1296×1024 ) with for fine-tuning the LLM. While training, instead of finetun-
dimensions 1024 onto the language model space, using an ing all of the weights that comprise the weight matrix of the
MLP adaptor with one hidden layer. The adaptor has an pre-trained Vicuna-v1.5[7], we finetune two smaller matri-
input dimensionality of 1024 and outputs a vector of size ces in LoRA [11] that approximate the original larger ma-
4096, corresponding to the input size of the LLM [7]. A trix. After that, the fine-tuned adaptor is fed into the pre-
GeLU [10] is used as the activation function. trained model and utilised for inference. The LoRA adap-
Large Language Model. The open source Vicuna- tation ensures faster training and avoids forgetting original
v1.5(7B) [7] large language model is utilised as the founda- knowledge embedded in the LLM trained and fine-tuned on
tion for GeoChat. The language model functions as a single generic natural language instructions. This is an important
interface for diverse vision-language inputs in our frame- feature since it allows the model to bring in external con-
work. To accomplish different vision-language tasks, we text about generic object types, landmarks and affordances
directly depend on the Vicuna-v1.5(7B) [7] language to- in the remote-sensing reasoning framework of GeoChat.
kens. We explicitly interact with the language model to
3.2. Training Details
construct textual representations of bounding boxes to ex-
press their spatial coordinates for the visual grounding tasks To enhance the effectiveness of our model on general visual
that require the production of spatial locations. Similarly, tasks and optimize training efficiency, we employ a strat-
the safe, aligned and effective behavior of LLM is ensured egy that involves initializing the network with pre-trained
via system prompts appended together with given inputs. A weights and fine-tuning specific segments for remote sens-

4
Object Attributes and Relationships
1. Ship {white,small,top-left}, {grey, small, bottom-right}
2. Harbor {small, right}, {large, bottom-left}, {small, top right}, {small, left}
3. Trees {top-right}.

Referring Expression
1. Small Grey ship anchored at large harbor on bottom-right
2. Some Trees at the top-right.
3. White ship at the top of a small harbor on the right.

Region Captioning
1. Object located at {}.
2. Size of object at. {} is small.

LLM

MultiTurn Conversation Detailed Description Complex Reasoning Grounded Description


USER: Where is the white ship USER: Please Describe the image in USER: What does the presence of USER:[grounding] Describe the
present? detail. 4 harbors and 2 ships indicate? image in detail.
ASSISTANT: White ship is ASSISTANT: This satellite image ASSISTANT: The presence of 4 ASSISTANT: This satellite image
present at the top left. shows depicts a river scene with harbors and 2 ships constitute a depicts a grey ship anchored at a
USER: What is the color of the multiple ships and harbors present. bustling maritime environment. The large harbor. There is another
ship anchored at the harbor? There are a total of two ships and presence of such harbors may harbor at the top right, which is
ASSISTANT: The ship anchored four harbors. The grey ship at the indicate a water area with the adjacent to some trees. There is
at the harbor is grey in color. bottom right is anchored to the possibility of export /imports of another white ship close to a
harbor……….. goods as well as tourism. harbor at the right.

Figure 4. Types of annotations available in the GeoChat instruction-set. Top-row: For a given RS image, we obtain object attribute and
relationship information, referring expressions and region captions along with their corresponding region annotations (shown over the
image). Bottom-row: This structured information is used to create the rich instruction-set with a total of 318k image-instruction pairs.

ing related tasks. We use a pre-trained CLIP-ViT(L-14) we obtain a total of nearly 306k image-instruction pairs
encoder[28],trained on large amounts of textual and visual for training and 12k for testing. Next, we outline the
data, a pretrained MLP adaptor[17], pretrained on a 558K instruction-set creation process.
subset of the LAION-CC-SBU [26] dataset with BLIP [15] Dataset Category # Classes # Images Image Size
captions, and Vicuna-v1.5[7] to initialize our model. To DOTA Object Detection 18 17,480 1024 × 1024
adapt our model to remote sensing images, we subsequently DIOR Object Detection 20 23,463 800 × 800
FAIR1M Object Detection 37 64,147 600 × 600
LoRA [11] fine-tune the LLM , while keeping the MLP LRBEN(rsvqa) Visual Question Answering - 600 256 × 256
Floodnet Visual Question Answering - 4056 3000 × 4000
adaptor and the CLIP encoder [28] frozen during training. NWPU-RESISC-45 Scene Classification 45 31,500 256 × 256

Table 2. List of datasets used to creat our remote-sensing instruc-


4. RS Multimodal Instruction Dataset tion set for GeoChat VLM training. We include object detection,
By using LLM Vicuna [7], we align the model to follow visual question answering and scene classification datasets with
a range of instructions by presenting and curating varied varying image sizes and types of classes to ensure diversity.
instruction-following data with multi-round conversations
regarding remote sensing imagery (Table 1). We specifi- Constituent Datasets: In the compilation of our in-
cally provide system instructions as prompts that ask Vi- struction set, we incorporate three distinct types of datasets,
cuna [7] to generate multi-round question and answer pairs encompassing the ones designed for object detection,
in a manner as if it could visualize the image (although it scene classification, and visual question answering (VQA).
only has access to the text). This is achieved by providing Specifically, we integrate three object detection (DOTA
few-shot in-context examples manually composed within [34], DIOR [6], and FAIR1M [27] which together form
the prompt to show Vicuna [7] how to build high-quality the SAMRS [30] dataset), one scene classification (NWPU-
instruction-response pairs based on the caption and infor- RESISC-45 [5]), one VQA (LRBEN[20]), and one flood
mation supplied. Specifically, from our short descriptions detection [25] VQA dataset (see Table 2). The object detec-
created using the below pipeline, we randomly sample 65k tion datasets allow region-level reasoning capability as they
images to create multi-round conversations, 10k images to offer segmentation masks along with bounding boxes.
generate complex question answers and 30k images to gen- Addition of Missing Classes: Although a wide vari-
erate detailed descriptions for the given short descriptions. ety of object classes are included in the object detection
In combination, after conversion to instruction format, databases, several essential categories like buildings, roads,

5
[grounding] Please describe the Give bounding box for the Boeing Is the whole road flooded?
image in detail. 747 at the bottom left.

Sure!
Yes, the road at the center is
completely flooded.
In the satellite image,
there are four baseball
fields located near some
buildings at the center of How many buildings are flooded?
the scene. The baseball
fields are arranged in a
close proximity to each
other. Three buildings are flooded and are
present at the top left.

Figure 5. Qualitative results of GeoChat. (left-right) Results are shown on grounding, referring object detection, and disaster/damage
detection. The user can provide task-specific tokens (e.g., [grounding]) to shape model responses according to the desired behavior. The
model can generate textual responses (right), only visual grounding (center) and both text and object groundings interleaved together (left).
The model can also specify object types, object counts, object attributes and object relationships.

Attribute Example Categories Example

a1 category (e.g. “plane, ship”) Ships and Harbors (e.g. “anchored at, parked at”)
Track Field and Soccer Field (e.g. “Surrounded by, Inside”)
a2 color (e.g. “gray, white”)
Vehicles, Bridge, Road, Roundabout (e.g. “passing through, passing through”)
a3 relative size (e.g. “small, large”) Vehicles and Building (e.g. “parked”)
a4 relative location (e.g. “top right, bottom”) Airport and Plane (e.g. “parked”)
a5 relation (e.g. “parked at, driving through”) Ship and Helipad (e.g. “on, contains”)

Table 3. List of attributes collected for objects. Attributes are used Table 4. Example of relationships between different objects used in
to obtain referring expressions e.g., small-sized plane to the left. the proposed instruction dataset.

and trees are missing. To address this, we propose to uti- stances of a class in the entire dataset and assigning the 80th
lize ViTAE-RVSA [31] model, pre-trained on the LoveDA percentile as the large label. Similarly, the 20th percentile
dataset [32], which encompasses the required important is designated as small size, with the remaining falling into
classes. The model [31] is used to infer these classes on the normal category. To determine the object’s relative po-
the SAMRS [30] dataset, yielding pseudo labels. To miti- sition within the images, we partition the entire image into
gate potential noise in these predictions, we remove the pre- a 3×3 grid, defining regions such as Top Right, Top, Top
dictions of ViTAE-RVSA [31] for which we already have Left, Left, Center, Right, Bottom Right, Bottom Left, and
ground truth from the SAMRS [30] dataset to refine the re- Bottom. Based on the object’s center pixel coordinates, we
sults. assign its relative position accordingly.
Attribute extraction: For referring expression annota- To define the relation between objects in a given image,
tions, it is important to derive a variety of attributes in RS we group different objects based on their distance between
images. To this end, we have selected five distinct types of the bounding boxes, and for each sub-graph, we assign dif-
attributes, as outlined in Table 3. Object category informa- ferent relationships between objects based on their class la-
tion can be directly obtained from the SAMRS dataset. For bels. Table 4 presents various examples of object relation-
color extraction, we use the K-Means clustering algorithm. ships. To establish relationships like “surrounded by,” we
Specifically, we extract the object’s pixels from the image cross-reference pixel-level coordinates to verify if one ob-
using ground-truth box and cluster them into K groups. The ject is entirely contained within another object.
center of the largest cluster is then selected as the object’s Expression Generation: To emulate natural language
color. To specify the relative size of the object, we catego- expressions, we employ predefined textual templates based
rize objects into three sizes: small, normal, and large. This on [39]. The phrase template encompasses the attributes
categorization is determined by measuring the area of all in- {a1, . . . , a5} from Table 3. The expression for a group of

6
Model UCMerced AID Method Presence Comparison Rural/Urban Avg. Accuracy
Qwen-VL [1] 62.90 52.60 LLaVA-1.5[17] 55.46 68.20 59.00 62.77
MiniGPTv2 [4] 4.76 12.90 Qwen-vl-Chat [1] 38.57 67.59 61.00 55.35
MiniGPTv2 [4] 55.16 55.22 39.00 54.96
LLaVA-1.5 [17] 68.00 51.00
GeoChat 84.43 72.03 RSVQA[20] 87.47 81.50 90.00 86.32
EasyToHard[38] 90.66 87.49 91.67 89.94
Table 5. Zero-shot scene classification accuracy comparison on Bi-Modal[2] 91.06 91.16 92.66 91.63
AID [33] and UCMerced [35] datasets. In comparison to other SHRNet [40] 91.03 90.48 94.00 91.84
generic VLMs, GeoChat performs favorably well. RSGPT[12] 91.17 91.70 94.00 92.29
GeoChat 91.09 90.33 94.00 90.70

objects of the same class is formulated as: Table 6. Comparisons with general zero-shot (top) and RS-VQA
specialized (middle) models on RSVQA-LRBEN [20] dataset for
”The/A ⟨a3⟩ ⟨a2⟩ a1⟨ in/on the a4⟩ .” VQA task. [1, 4, 17] are evaluated in zero-shot setting. GeoChat
outperforms other zero-shot models and performs competitively to
Attributes that may be absent are enclosed in 〈〉, and at- SoTA-supervised models like RSGPT which are specifically fine-
tributes {a2, a3} can be arranged in any sequence. tuned on target dataset (while ours is a generic model not specifi-
Similarly, the sentence template incorporates the rela- cally finetuned on target dataset).
tional attributes a5 to establish connections between two
objects through this structure:
Google Earth imagery, with 30 classes, such as a river,
”The/A ⟨ai 3⟩ ⟨ai 2⟩ ai 1 ai 5 aj 1⟨ in/on the aj 4⟩.” dense residential area, etc. The images are labeled by spe-
cialists in the field of remote sensing image interpretation.
Here, the indicies i and j represent the ith and j th object. In total, the AID [33] dataset has 10,000 images within 30
Visual Grounding: Although referring expression classes. The images have been taken from different coun-
datasets are available in the natural image domain [36, 37], tries as well as different weather conditions. For evaluation,
they lack for the remote sensing domain. To this end, we we use a 20% split of the AID [33] dataset. UCMerced [35]
use our short descriptions as referring expressions to cre- is a Land Use scene classification dataset, with 2,100 im-
ate three different kinds of question answering pairs, i.e. ages and 21 classes. Each image is of size 256×256. We
grounding image description, referring expression, and re- use the whole UCMerced [35] dataset as a zero-shot test set.
gion level captioning, as described in Table 1. Results. We prompt the models with all of the
classes and prompt to classify the image using just one
5. Experiments word/phrase. For example, we input a prompt like ”Clas-
sify the image within one of the given classes: dense
5.1. Implementation Details residential area, . . . , school. Answer with one
We initialize the weights of our model with the pretrained word or short phrase.”. We calculate zero-shot accu-
CLIP-ViT [24], and LLM (Vicuna-v1.5 [7] and apply LoRA racy on both AID and UCMerced. GeoChat significantly
[11] finetuning. Utilizing LoRA, we refine the parameters outperforms other VLM’s with an accuracy of 84.43% on
Wq and Wv through low-rank adaptation, with a designated UCMerced [35] and 72.03% on AID [33], as presented in
rank r set to 64 in our implementation. The model under- Table 5. Notably, the recent MiniGPT-4-v2[4] fails to fol-
goes training consistently at an image resolution of 504 × low the instructions provided for this specific task and re-
504 throughout the whole process. Each training step incor- turns unrelated classes that are not a part of the dataset. It’s
porates specifically crafted multi-modal instructional tem- accuracy is close to 5% if we pass the answers from Vicuna-
plates designed for a variety of vision-language tasks during v1.5 [7] and ask it to check if the output sentence refers to
the training process. We use AdamW [21] optimizer with a the ground truth class or not. In comparison, Qwen-VL and
cosine learning rate scheduler to train our model. We keep LLaVa-1.5 perform well in instruction following, but fall
the global batch size as 144. We train our model in two short to GeoChat, due to lack of domain knowledge.
stages, first, we train using all of our datasets for 1 epoch,
correspondingly 2400 steps, followed by stage 2, where we 5.3. Visual Question Answering
only train on the grounding dataset for 1600 more steps. Datasets for evaluation. RSVQA-HRBEN [20] comprises
10,569 high-resolution photos and 1,066,316 question-
5.2. Scene Classification
answer pairs, with 61.5%, 11.2%, 20.5%, and 6.8% di-
Datasets for evaluation. For scene classification, we eval- vided into training, validation, test 1, and test 2 sets, re-
uate our model using AID [33] and UCMerced [35]. AID spectively. This dataset has three question types: presence,
[33] is a large-scale aerial image collection compiled from comparison, and count. For evaluation, we use the test set-2

7
Model Small Medium Large Single-object grounding Multi-object grounding [refer] [grounding] Overall
MiniGPTv2 [4] 1.7 9.9 21.9 9.1 3.6 8.2 2.6 7.6
GeoChat 2.9 13.6 21.7 16.0 4.3 10.5 11.8 10.6

Table 7. Performance ([email protected]%) comparison of GeoChat on our benchmark. Small, medium and large refer to the size of the objects
based on the bounding box area. Single/multi-object refer to how many objects the question asks the model to predict. [refer]: object
referenced using one attribute from a2, a3 or a4 in Table 3. [grounding]: objects referenced using a combination of attributes from a1-a5
in Table 3. Overall, GeoChat outperforms the baseline, but there is still significant room for further improvement on this complex task.

Model Presence Comparison Average Accuracy Model [email protected] [email protected] METEOR


Qwen-VL[1] 66.44 60.41 63.06 MiniGPTv2[4] 10.8 30.9 16.4
LLaVA-1.5[17] 69.83 67.29 68.40 GeoChat 11.7 33.9 48.9
MiniGPTv2[4] 40.79 50.91 46.46 Table 9. Results on grounding description task.
GeoChat 58.45 83.19 72.30
Model ROUGE-1 ROUGE-L METEOR
Table 8. Comparison with other general ZS model’s on RSVQA-
HRBEN [20] dataset for visual qa. All models here have not been MiniGPTv2[4] 32.1 31.2 10.0
trained on the target dataset. GeoChat performs favorably well GeoChat 87.3 87.2 83.9
compared to generic VLMs.
Table 10. Region level captioning performance.
for RSVQA-HRBEN [20] with 47k question answer pairs.
overlap of more than 0.5 IoU with the ground-truth box.
RSVQA-LR [20] is made up of 772 low-resolution im-
Results. Table 7 shows the performance of our method
ages and 77,232 question-answer pairs, with 77.8%, 11.1%,
and MiniGPT-4-v2 [4] on the proposed benchmark. Over-
and 11.1% used for training, validation, and testing, re-
all, the model performance is low on small objects or when
spectively. There are four different categories of questions:
it has to predict multiple boxes. Compared to MiniGPT-4-
presence, comparison, rural/urban, and count. We omitted
v2[4], our model works better on medium size images. On
area and count questions during evaluation because the re-
the grounding description task, we calculate both, the IoU
sponses are numerical and quantifiable into numerous cat-
for the multiple bounding boxes generated as well as the
egories. In the RSVQA-LRBEN [20] dataset, for exam-
text answer generated. Our model provides a better descrip-
ple, counting questions are quantified into five categories:
tion with slightly better box accuracy than MiniGPT-4-v2
0, between 1 and 10, between 11 and 100, between 101 and
[4] (Table 9). As for region-level captioning, we evaluate
1000, and greater than 1000. For evaluation, we use the test
both models based on the text accuracy with ground truth
set of RSVQA-LRBEN [20] with 7k question-answer pairs.
region-level captions (Table 10). Our model significantly
Results. To constrain the answers to a simple yes/no and
outperforms MiniGPT-4-v2 in terms of ROUGE and ME-
for rural/urban question types, we add a suitable prompt
TEOR score.
at the end of each question. GeoChat performs close to
the SOTA specialist models on RSVQA-LRBEN test set,
which is RSGPT [12], finetuned on the target dataset for 6. Conclusion
5 iterations in comparison. We also match the SOTA on Although recent advancements in large Vision-Language
urban-rural classification subset, as presented in Table 6. Models (VLMs) have shown promise in nature image do-
For RSVQA-HRBEN, GeoChat outperforms other VLM’s mains, their performance in Remote Sensing (RS) scenar-
in zero-shot setting on average accuracy by 3.9%, while ios is still limited due to the unique domain-specific chal-
beating the Comparison subset by 15.9% on LLaVA-v1.5 lenges. Addressing this gap, we present GeoChat, the
[17], as shown in Table 8. first unified remote sensing VLM that excels in multitask
conversational capabilities with high-resolution RS images.
5.4. Visual Grounding
GeoChat not only answers image-level queries but also en-
Datasets for evaluation. For the evaluation of grounding gages in region-specific dialogue, grounding responses with
tasks, we propose a new benchmark that contains different precise spatial coordinates. We create a novel RS mul-
referring and grounding tasks. We use the validation set timodal instruction-following dataset comprising of 318k
from [30] and used the same dataset creation pipeline as in image-instruction pairs with a diverse multitask format.
Sec. 4 to construct the test benchmark. There are a total GeoChat achieves robust zero-shot performance across var-
of 7653 [refer], 758 [grounding], and 555 grounding de- ious RS tasks including scene classification, VQA, multi-
scription questions. We use [email protected] as the evaluation turn dialogue, visual grounding and referring object detec-
metric. Accuracy is calculated if the predicted box has an tion, thus establishing a comprehensive benchmark.

8
References and benchmark. arXiv preprint arXiv:2307.15266, 2023. 1,
2, 7, 8
[1] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan
[13] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang,
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren
Jingkang Yang, and Ziwei Liu. Otter: A multi-modal
Zhou. Qwen-vl: A frontier large vision-language model with
model with in-context instruction tuning. arXiv preprint
versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 7,
arXiv:2305.03726, 2023. 2
8
[2] Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Mo- [14] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama,
hamed Lamine Mekhalfi, Mansour Abdulaziz Al Zuair, and Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon,
Farid Melgani. Bi-modal transformer-based approach for vi- and Jianfeng Gao. Llava-med: Training a large language-
sual question answering in remote sensing imagery. IEEE and-vision assistant for biomedicine in one day. arXiv
Transactions on Geoscience and Remote Sensing, 60:1–11, preprint arXiv:2306.00890, 2023. 2
2022. 7 [15] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
[3] Christel Chappuis, Valérie Zermatten, Sylvain Lobry, Blip: Bootstrapping language-image pre-training for uni-
Bertrand Le Saux, and Devis Tuia. Prompt-rsvqa: Prompting fied vision-language understanding and generation. In In-
visual context to a language model for remote sensing visual ternational Conference on Machine Learning, pages 12888–
question answering. In Proceedings of the IEEE/CVF Con- 12900. PMLR, 2022. 5
ference on Computer Vision and Pattern Recognition, pages [16] Xiang Li, Congcong Wen, Yuan Hu, and Nan Zhou. Rs-clip:
1372–1381, 2022. 2 Zero shot remote sensing scene classification via contrastive
[4] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun vision-language supervision. International Journal of Ap-
Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, plied Earth Observation and Geoinformation, 124:103497,
Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. 2
Minigpt-v2: large language model as a unified interface [17] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
for vision-language multi-task learning. arXiv preprint Improved baselines with visual instruction tuning. arXiv
arXiv:2310.09478, 2023. 7, 8 preprint arXiv:2310.03744, 2023. 2, 3, 5, 7, 8
[5] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- [18] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
ing image scene classification: Benchmark and state of the Visual instruction tuning. arXiv preprint arXiv:2304.08485,
art. Proceedings of the IEEE, 105(10):1865–1883, 2017. 2, 2023. 2
3, 5 [19] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[6] Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Visual instruction tuning, 2023. 1, 2
Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented [20] Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia.
proposal generator for object detection. IEEE Transactions Rsvqa: Visual question answering for remote sensing data.
on Geoscience and Remote Sensing, 60:1–11, 2022. 5 IEEE Transactions on Geoscience and Remote Sensing, 58
[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- (12):8555–8566, 2020. 2, 3, 5, 7, 8
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong-
[21] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.
regularization. arXiv preprint arXiv:1711.05101, 2017. 7
Xing. Vicuna: An open-source chatbot impressing gpt-4
with 90%* chatgpt quality, 2023. 2, 4, 5, 7 [22] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa-
[8] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat had Shahbaz Khan. Video-chatgpt: Towards detailed video
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale understanding via large vision and language models. arXiv
Fung, and Steven Hoi. Instructblip: Towards general- preprint arXiv:2306.05424, 2023. 2
purpose vision-language models with instruction tuning, [23] OpenAI. Gpt-4 technical report, 2023. 1
2023. 2 [24] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- transferable visual models from natural language supervi-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is sion. In International conference on machine learning, pages
worth 16x16 words: Transformers for image recognition at 8748–8763. PMLR, 2021. 7
scale. ICLR, 2021. 2 [25] Maryam Rahnemoonfar, Tashnim Chowdhury, Argho
[10] Dan Hendrycks and Kevin Gimpel. Gaussian error linear Sarkar, Debvrat Varshney, Masoud Yari, and Robin Mur-
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 4 phy. Floodnet: A high resolution aerial imagery dataset
[11] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- for post flood scene understanding. arXiv preprint
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. arXiv:2012.02951, 2020. 3, 5
LoRA: Low-rank adaptation of large language models. In In- [26] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
ternational Conference on Learning Representations, 2022. Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
2, 4, 5, 7 Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
[12] Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, and Open dataset of clip-filtered 400 million image-text pairs.
Xiang Li. Rsgpt: A remote sensing vision language model arXiv preprint arXiv:2111.02114, 2021. 5

9
[27] Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping [39] Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring
Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao data and models for visual grounding on remote sensing data.
Xu, Martin Weinmann, Stefan Hinz, Cheng Wang, and Kun IEEE Transactions on Geoscience and Remote Sensing, 61:
Fu. Fair1m: A benchmark dataset for fine-grained object 1–13, 2023. 6
recognition in high-resolution remote sensing imagery. IS- [40] Zixiao Zhang, Licheng Jiao, Lingling Li, Xu Liu, Puhua
PRS Journal of Photogrammetry and Remote Sensing, 184: Chen, Fang Liu, Yuxuan Li, and Zhicheng Guo. A spa-
116–130, 2022. 5 tial hierarchical reasoning network for remote sensing visual
[28] Yi Tay, Minh C Phan, Luu Anh Tuan, and Siu Cheung Hui. question answering. IEEE Transactions on Geoscience and
Learning to rank question answer pairs with holographic dual Remote Sensing, 61:1–15, 2023. 2, 7
lstm architecture. In Proceedings of the 40th International [41] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
ACM SIGIR Conference on Research and Development in hamed Elhoseiny. Minigpt-4: Enhancing vision-language
Information Retrieval, pages 695–704. ACM, 2017. 3, 4, 5 understanding with advanced large language models. arXiv
[29] Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mul- preprint arXiv:2304.10592, 2023. 2
lappilly, Hisham Cholakkal, Rao Muhammad Anwer, [42] Usman Zia, M Mohsin Riaz, and Abdul Ghafoor. Transform-
Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. ing remote sensing images to textual descriptions. Interna-
Xraygpt: Chest radiographs summarization using large med- tional Journal of Applied Earth Observation and Geoinfor-
ical vision-language models. arXiv: 2306.07971, 2023. 2 mation, 108:102741, 2022. 2
[30] Di Wang, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei
Zhang. Scaling-up remote sensing segmentation dataset with
segment anything model. In arxiv, 2023. 2, 5, 6, 8
[31] Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du,
Dacheng Tao, and Liangpei Zhang. Advancing plain vision
transformer toward remote sensing foundation model. IEEE
Transactions on Geoscience and Remote Sensing, 61:1–15,
2023. 6
[32] Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and
Yanfei Zhong. LoveDA: A remote sensing land-cover dataset
for domain adaptive semantic segmentation, 2021. 6
[33] Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang
Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid:
A benchmark data set for performance evaluation of aerial
scene classification. IEEE Transactions on Geoscience and
Remote Sensing, 55(7):3965–3981, 2017. 7
[34] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be-
longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang-
pei Zhang. Dota: A large-scale dataset for object detection in
aerial images. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2018. 5
[35] Yi Yang and Shawn Newsam. Bag-of-visual-words and spa-
tial extensions for land-use classification. In Proceedings of
the 18th SIGSPATIAL international conference on advances
in geographic information systems, pages 270–279, 2010. 7
[36] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
maier. From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descrip-
tions. Transactions of the Association for Computational
Linguistics, 2:67–78, 2014. 7
[37] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg,
and Tamara L Berg. Modeling context in referring expres-
sions. In Computer Vision–ECCV 2016: 14th European
Conference, Amsterdam, The Netherlands, October 11-14,
2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
7
[38] Zhenghang Yuan, Lichao Mou, Qi Wang, and Xiao Xiang
Zhu. From easy to hard: Learning language-guided curricu-
lum for visual question answering on remote sensing data.
IEEE Transactions on Geoscience and Remote Sensing, 60:
1–11, 2022. 2, 7

10

You might also like