We introduce SPAgent, a spatial intelligence agent designed to operate in the physical and spatial world. SPAgent enables agents to invoke a diverse set of multi-modal expert tools (depth estimation, segmentation, 3D reconstruction, etc.) to perceive, understand, and reason about real-world spatial environments.
- Documentation
- SPAgent Features
- Project Structure
- External Experts
- Installation & Setup
- Quick Start
- Testing & Development
- Reinforcement Learning Training
- Important Notes
| Document | Description |
|---|---|
| Tool Reference | External expert tools API and deployment guide |
| Evaluation Guide | Dataset download and evaluation usage |
| Advanced Examples | Specialized agents, tool mixing, and RL training |
SPAgent provides a modern, modular architecture with the following features:
- β Modular Tool System - Mix and match any combination of expert tools
- β Dynamic Tool Management - Add/remove tools at runtime
- β Parallel Tool Execution - Automatic concurrent processing when possible
- β Multi-Image Analysis - Handle single or multiple images seamlessly
- β Multiple Model Support - GPT, Qwen, and local VLLM models
- β Flexible Configuration - Easy to customize and extend
- β Reinforcement Learning - Support reinforcement learning
| Module | Path | Description |
|---|---|---|
| SPAgent Core | spagent/core/ |
Core agent architecture: - SPAgent class and agent logic - Tool base classes and registry - Model base classes and wrappers - Unified prompt system - Data collection utilities |
| Tools | spagent/tools/ |
Modular expert tool implementations: - DepthEstimationTool - SegmentationTool - ObjectDetectionTool - SupervisionTool - YOLOETool - MoondreamTool - Pi3Tool |
| Models | spagent/models/ |
Model wrappers for different backends: - GPTModel (OpenAI API) - QwenModel (DashScope API) - QwenVLLMModel (local VLLM) |
| External Experts | spagent/external_experts/ |
Specialized expert models with client/server architecture: - Depth Estimation (Depth-AnythingV2) - Image/Video Segmentation (SAM2) - Open-vocabulary Detection (GroundingDINO) - Vision Language Model (Moondream) - 3D Point Cloud Reconstruction (Pi3) - YOLO-E Detection & Annotation (Supervision) - Each includes client/server implementations and can run as external APIs |
| VLLM Models | spagent/vllm_models/ |
VLLM inference utilities and wrappers: - GPT API wrapper - Qwen API wrapper - Local VLLM inference for Qwen models |
| Examples | examples/ |
Example scripts and usage tutorials: - Evaluation scripts for datasets - Quick start examples - Tool definition examples |
| Test | test/ |
Test scripts for tools and models: - Pi3 tool testing with video frame extraction - Integration tests |
| Train | train/ |
Reinforcement learning training scripts: - GRPO training configurations - LoRA merge and model compression utilities - System prompts for different training modes |
| Tool Name | Type | Main Function | Default Port | Notes |
|---|---|---|---|---|
| Depth-AnythingV2 | 3D | Monocular Depth Estimation | 20019 | Convert 2D images to pixel-level depth maps |
| SAM2 | 2D | Image Segmentation | 20020 | Segment Anything Model 2nd generation, interactive or automatic segmentation |
| GroundingDINO | 2D | Open-vocabulary Object Detection | 20022 | Detect arbitrary objects based on text descriptions |
| Moondream | 2D | Vision Language Model | 20024 | Small and efficient visual Q&A model, supports image description and Q&A |
| Pi3 | 3D | 3D Point Cloud Reconstruction | 20030 | Generate 3D point clouds and multi-view rendered images from a single image |
| Supervision | 2D | Object Detection Annotation | - | YOLO models and visualization tools, used for result visualization and post-processing |
# Create Python 3.11 environment (other versions may have compatibility issues)
conda create -n spagent python=3.11
conda activate spagent
# Install dependencies
pip install -r requirements.txt
pip install "httpx[socks]"# OpenAI API
export OPENAI_API_KEY="your_api_key"
export OPENAI_BASE_URL="your_base_url"
# Qwen API (Apply at: https://bailian.console.aliyun.com)
export DASHSCOPE_API_KEY="your_api_key"
# Moondream API (Apply at: https://moondream.ai)
export MOONDREAM_API_KEY="your_api_key"
# Test API connection
python spagent/vllm_models/qwen.pyFor detailed external expert tools usage guide, please refer to: External Experts Tool Usage Guide
from spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import DepthEstimationTool, SegmentationTool
# Create model and tools
model = GPTModel(model_name="gpt-4o-mini")
tools = [
DepthEstimationTool(use_mock=True), # Depth estimation
SegmentationTool(use_mock=True) # Image segmentation
]
# Create agent
agent = SPAgent(model=model, tools=tools)
# Solve problem
result = agent.solve_problem("image.jpg", "Analyze the depth relationships and main objects in this image")
print(result['answer'])from spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import (
DepthEstimationTool, # Depth estimation
SegmentationTool, # Image segmentation
ObjectDetectionTool, # Object detection
SupervisionTool, # Supervision tool
YOLOETool, # YOLO-E detection
MoondreamTool, # Visual Q&A
Pi3Tool # 3D reconstruction
)
# Create full-featured agent
model = GPTModel(model_name="gpt-4o-mini")
tools = [
DepthEstimationTool(use_mock=True),
SegmentationTool(use_mock=True),
ObjectDetectionTool(use_mock=True),
SupervisionTool(use_mock=True),
YOLOETool(use_mock=True)
]
agent = SPAgent(model=model, tools=tools, max_workers=4)
# Complex problem analysis
result = agent.solve_problem(
"image.jpg",
"Comprehensively analyze this image: identify all objects, analyze depth relationships, and segment important regions"
)
print(f"Answer: {result['answer']}")
print(f"Used tools: {result['used_tools']}")
print(f"Additional images: {result['additional_images']}")# Start with a basic agent
agent = SPAgent(model=GPTModel())
# Dynamically add tools
agent.add_tool(DepthEstimationTool(use_mock=True))
agent.add_tool(SegmentationTool(use_mock=True))
# View current tools
print(f"Current tools: {agent.list_tools()}")
# Remove unnecessary tools
agent.remove_tool("depth_estimation_tool")
# Change model
from spagent.models import QwenModel
agent.set_model(QwenModel(model_name="qwen2.5-vl-7b-instruct"))# Analyze multiple images
image_paths = ["image1.jpg", "image2.jpg", "image3.jpg"]
result = agent.solve_problem(
image_paths,
"Compare the differences between these images, analyze depth changes and object distribution"
)For detailed image dataset evaluation usage guide, please refer to: Image Dataset Evaluation Usage Guide
Basic Evaluation Commands:
# Normal evaluation
python examples/evaluation/evaluate_img.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 3 --task "your task name"
# Evaluation without tools (clean version)
python examples/evaluation/evaluate_img_wotools.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 1 --task "your task name"
# Collect data for SFT
python examples/evaluation/evaluate_img_with_data_collection.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 3 --enable_data_collection
# Example: Evaluate on BLINK dataset
python examples/evaluation/evaluate_img.py --data_path dataset/Multi-view_Reasoning_BLINK_subset.jsonl --max_samples 20 --model gpt-4.1 --max_iterations 4For more advanced usage patterns, specialized agents, tool mixing strategies, video analysis, and reinforcement learning training, please refer to: Advanced Examples
# Use real deployed services
tools = [
DepthEstimationTool(use_mock=False, server_url="http://localhost:20019"),
SegmentationTool(use_mock=False, server_url="http://localhost:20020"),
ObjectDetectionTool(use_mock=False, server_url="http://localhost:30969")
]Test Pi3 tool with video frame extraction:
# test/test_pi3_llm.py - Video analysis with Pi3 3D reconstruction
from spagent.core.spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import Pi3Tool
# Configure model and Pi3 tool
model = GPTModel(model_name="gpt-4o-mini", temperature=0.7)
tools = [Pi3Tool(use_mock=False, server_url="http://localhost:20030")]
agent = SPAgent(model=model, tools=tools, max_workers=4)
# Analyze video frames
result = agent.solve_problem(
frame_paths, # List of extracted frame paths
"Based on these frames from a video, please answer: Which direction did the object move?",
video_path="path/to/video.mp4", # Optional: for Pi3 to extract more frames
pi3_num_frames=50 # Number of frames for Pi3 analysis
)SPAgent supports GRPO (Group Relative Policy Optimization) reinforcement learning training using ms-swift.
| Script | Description |
|---|---|
train/train_grpo.sh |
Standard GRPO training with tool calling |
train/train_grpo_all_angles.sh |
GRPO training with all angle combinations |
train/train_grpo_notool.sh |
GRPO training without tool calling (baseline) |
train/merge_lora.sh |
Merge LoRA adapters into base model |
train/compress_model.sh |
Compress trained model checkpoints |
# Standard GRPO training
cd train
bash train_grpo.sh
# Training without tools (baseline)
bash train_grpo_notool.sh
# Training with all angle combinations
bash train_grpo_all_angles.shswift rlhf \
--rlhf_type grpo \
--model path/to/Qwen3-VL-4B-Instruct \
--external_plugins plugin/plugin.py \
--multi_turn_scheduler spagent_tool_call_scheduler \
--max_turns 3 \
--reward_funcs external_r1v_acc external_multiturn_format \
--reward_weights 1.0 1.0 \
--train_type full \
--torch_dtype bfloat16 \
--dataset path/to/training_data.jsonl \
--num_generations 8 \
--temperature 0.6 \
--deepspeed zero2 \
--output_dir output/grpo_experiment# Merge LoRA weights into base model
swift export \
--adapters output/grpo_xxx/checkpoint-xxx \
--merge_lora true
# Compress model checkpoint for deployment
bash train/compress_model.sh- Python Version: Python 3.11 is recommended, other versions may have compatibility issues
- Memory Requirements: Real mode requires GPU memory >= 24GB
- Network Configuration: Ensure API keys and server addresses are configured correctly
- Concurrency Control: Control the number of parallel tools via the
max_workersparameter
If you find this work helpful, please consider citing our paper:
@article{zhang2026think3d,
title={Think3D: Thinking with Space for Spatial Reasoning},
author={Zhang, Zaibin and Wu, Yuhan and Jia, Lianjie and Wang, Yifan and Zhang, Zhongbo and Li, Yijiang and Ran, Binghao and Zhang, Fuxi and Sun, Zhuohan and Yin, Zhenfei and others},
journal={arXiv preprint arXiv:2601.13029},
year={2026}
}If you find SPAgent useful for your research or projects, please consider giving us a β star! Your support helps us continue improving and maintaining this project.
