Skip to content

SPAgent, a spatial intelligence agent designed to operate in the physical and spatial world.

Notifications You must be signed in to change notification settings

zhangzaibin/spagent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

257 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SPAgent Logo

🌍 SPAgent: Agent in the Physical & Spatial World

Think3D: Thinking with Space for Spatial Reasoning

arXiv Hugging Face Hugging Face License Python 3.11


πŸ“Œ Introduction

We introduce SPAgent, a spatial intelligence agent designed to operate in the physical and spatial world. SPAgent enables agents to invoke a diverse set of multi-modal expert tools (depth estimation, segmentation, 3D reconstruction, etc.) to perceive, understand, and reason about real-world spatial environments.

πŸ“‹ Table of Contents

πŸ“š Documentation

Document Description
Tool Reference External expert tools API and deployment guide
Evaluation Guide Dataset download and evaluation usage
Advanced Examples Specialized agents, tool mixing, and RL training

SPAgent Features

SPAgent provides a modern, modular architecture with the following features:

  • βœ… Modular Tool System - Mix and match any combination of expert tools
  • βœ… Dynamic Tool Management - Add/remove tools at runtime
  • βœ… Parallel Tool Execution - Automatic concurrent processing when possible
  • βœ… Multi-Image Analysis - Handle single or multiple images seamlessly
  • βœ… Multiple Model Support - GPT, Qwen, and local VLLM models
  • βœ… Flexible Configuration - Easy to customize and extend
  • βœ… Reinforcement Learning - Support reinforcement learning

πŸ“‚ Project Structure

Module Path Description
SPAgent Core spagent/core/ Core agent architecture:
- SPAgent class and agent logic
- Tool base classes and registry
- Model base classes and wrappers
- Unified prompt system
- Data collection utilities
Tools spagent/tools/ Modular expert tool implementations:
- DepthEstimationTool
- SegmentationTool
- ObjectDetectionTool
- SupervisionTool
- YOLOETool
- MoondreamTool
- Pi3Tool
Models spagent/models/ Model wrappers for different backends:
- GPTModel (OpenAI API)
- QwenModel (DashScope API)
- QwenVLLMModel (local VLLM)
External Experts spagent/external_experts/ Specialized expert models with client/server architecture:
- Depth Estimation (Depth-AnythingV2)
- Image/Video Segmentation (SAM2)
- Open-vocabulary Detection (GroundingDINO)
- Vision Language Model (Moondream)
- 3D Point Cloud Reconstruction (Pi3)
- YOLO-E Detection & Annotation (Supervision)
- Each includes client/server implementations and can run as external APIs
VLLM Models spagent/vllm_models/ VLLM inference utilities and wrappers:
- GPT API wrapper
- Qwen API wrapper
- Local VLLM inference for Qwen models
Examples examples/ Example scripts and usage tutorials:
- Evaluation scripts for datasets
- Quick start examples
- Tool definition examples
Test test/ Test scripts for tools and models:
- Pi3 tool testing with video frame extraction
- Integration tests
Train train/ Reinforcement learning training scripts:
- GRPO training configurations
- LoRA merge and model compression utilities
- System prompts for different training modes

πŸ” External Experts

Tool Name Type Main Function Default Port Notes
Depth-AnythingV2 3D Monocular Depth Estimation 20019 Convert 2D images to pixel-level depth maps
SAM2 2D Image Segmentation 20020 Segment Anything Model 2nd generation, interactive or automatic segmentation
GroundingDINO 2D Open-vocabulary Object Detection 20022 Detect arbitrary objects based on text descriptions
Moondream 2D Vision Language Model 20024 Small and efficient visual Q&A model, supports image description and Q&A
Pi3 3D 3D Point Cloud Reconstruction 20030 Generate 3D point clouds and multi-view rendered images from a single image
Supervision 2D Object Detection Annotation - YOLO models and visualization tools, used for result visualization and post-processing

πŸ› οΈ Installation & Setup

1. Environment Setup

# Create Python 3.11 environment (other versions may have compatibility issues)
conda create -n spagent python=3.11
conda activate spagent

# Install dependencies
pip install -r requirements.txt
pip install "httpx[socks]"

2. API Configuration

# OpenAI API
export OPENAI_API_KEY="your_api_key"
export OPENAI_BASE_URL="your_base_url"

# Qwen API (Apply at: https://bailian.console.aliyun.com)
export DASHSCOPE_API_KEY="your_api_key"

# Moondream API (Apply at: https://moondream.ai)
export MOONDREAM_API_KEY="your_api_key"

# Test API connection
python spagent/vllm_models/qwen.py

3. Deploy External Expert Services

For detailed external expert tools usage guide, please refer to: External Experts Tool Usage Guide

πŸš€ Quick Start

1. Basic Usage

from spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import DepthEstimationTool, SegmentationTool

# Create model and tools
model = GPTModel(model_name="gpt-4o-mini")
tools = [
    DepthEstimationTool(use_mock=True),    # Depth estimation
    SegmentationTool(use_mock=True)        # Image segmentation
]

# Create agent
agent = SPAgent(model=model, tools=tools)

# Solve problem
result = agent.solve_problem("image.jpg", "Analyze the depth relationships and main objects in this image")
print(result['answer'])

2. Multi-Tool Usage

from spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import (
    DepthEstimationTool,      # Depth estimation
    SegmentationTool,         # Image segmentation  
    ObjectDetectionTool,      # Object detection
    SupervisionTool,          # Supervision tool
    YOLOETool,                # YOLO-E detection
    MoondreamTool,            # Visual Q&A
    Pi3Tool                   # 3D reconstruction
)

# Create full-featured agent
model = GPTModel(model_name="gpt-4o-mini")
tools = [
    DepthEstimationTool(use_mock=True),
    SegmentationTool(use_mock=True),
    ObjectDetectionTool(use_mock=True),
    SupervisionTool(use_mock=True),
    YOLOETool(use_mock=True)
]

agent = SPAgent(model=model, tools=tools, max_workers=4)

# Complex problem analysis
result = agent.solve_problem(
    "image.jpg", 
    "Comprehensively analyze this image: identify all objects, analyze depth relationships, and segment important regions"
)

print(f"Answer: {result['answer']}")
print(f"Used tools: {result['used_tools']}")
print(f"Additional images: {result['additional_images']}")

3. Dynamic Tool Management

# Start with a basic agent
agent = SPAgent(model=GPTModel())

# Dynamically add tools
agent.add_tool(DepthEstimationTool(use_mock=True))
agent.add_tool(SegmentationTool(use_mock=True))

# View current tools
print(f"Current tools: {agent.list_tools()}")

# Remove unnecessary tools
agent.remove_tool("depth_estimation_tool")

# Change model
from spagent.models import QwenModel
agent.set_model(QwenModel(model_name="qwen2.5-vl-7b-instruct"))

4. Multi-Image Analysis

# Analyze multiple images
image_paths = ["image1.jpg", "image2.jpg", "image3.jpg"]
result = agent.solve_problem(
    image_paths, 
    "Compare the differences between these images, analyze depth changes and object distribution"
)

5. Image Dataset Evaluation

For detailed image dataset evaluation usage guide, please refer to: Image Dataset Evaluation Usage Guide

Basic Evaluation Commands:

# Normal evaluation
python examples/evaluation/evaluate_img.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 3 --task "your task name"

# Evaluation without tools (clean version)
python examples/evaluation/evaluate_img_wotools.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 1 --task "your task name"

# Collect data for SFT
python examples/evaluation/evaluate_img_with_data_collection.py --data_path path/to/json --model gpt/qwen3-vl-4b --max_samples 15 --max_iterations 3 --enable_data_collection

# Example: Evaluate on BLINK dataset
python examples/evaluation/evaluate_img.py --data_path dataset/Multi-view_Reasoning_BLINK_subset.jsonl --max_samples 20 --model gpt-4.1 --max_iterations 4

For more advanced usage patterns, specialized agents, tool mixing strategies, video analysis, and reinforcement learning training, please refer to: Advanced Examples

πŸ§ͺ Testing & Development

Real Service Mode

# Use real deployed services
tools = [
    DepthEstimationTool(use_mock=False, server_url="http://localhost:20019"),
    SegmentationTool(use_mock=False, server_url="http://localhost:20020"),
    ObjectDetectionTool(use_mock=False, server_url="http://localhost:30969")
]

Video Analysis Testing

Test Pi3 tool with video frame extraction:

# test/test_pi3_llm.py - Video analysis with Pi3 3D reconstruction
from spagent.core.spagent import SPAgent
from spagent.models import GPTModel
from spagent.tools import Pi3Tool

# Configure model and Pi3 tool
model = GPTModel(model_name="gpt-4o-mini", temperature=0.7)
tools = [Pi3Tool(use_mock=False, server_url="http://localhost:20030")]

agent = SPAgent(model=model, tools=tools, max_workers=4)

# Analyze video frames
result = agent.solve_problem(
    frame_paths,  # List of extracted frame paths
    "Based on these frames from a video, please answer: Which direction did the object move?",
    video_path="path/to/video.mp4",  # Optional: for Pi3 to extract more frames
    pi3_num_frames=50  # Number of frames for Pi3 analysis
)

🎯 Reinforcement Learning Training

SPAgent supports GRPO (Group Relative Policy Optimization) reinforcement learning training using ms-swift.

Training Scripts

Script Description
train/train_grpo.sh Standard GRPO training with tool calling
train/train_grpo_all_angles.sh GRPO training with all angle combinations
train/train_grpo_notool.sh GRPO training without tool calling (baseline)
train/merge_lora.sh Merge LoRA adapters into base model
train/compress_model.sh Compress trained model checkpoints

Basic Training Command

# Standard GRPO training
cd train
bash train_grpo.sh

# Training without tools (baseline)
bash train_grpo_notool.sh

# Training with all angle combinations
bash train_grpo_all_angles.sh

Key Training Parameters

swift rlhf \
    --rlhf_type grpo \
    --model path/to/Qwen3-VL-4B-Instruct \
    --external_plugins plugin/plugin.py \
    --multi_turn_scheduler spagent_tool_call_scheduler \
    --max_turns 3 \
    --reward_funcs external_r1v_acc external_multiturn_format \
    --reward_weights 1.0 1.0 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset path/to/training_data.jsonl \
    --num_generations 8 \
    --temperature 0.6 \
    --deepspeed zero2 \
    --output_dir output/grpo_experiment

Post-Training

# Merge LoRA weights into base model
swift export \
    --adapters output/grpo_xxx/checkpoint-xxx \
    --merge_lora true

# Compress model checkpoint for deployment
bash train/compress_model.sh

⚠️ Important Notes

  1. Python Version: Python 3.11 is recommended, other versions may have compatibility issues
  2. Memory Requirements: Real mode requires GPU memory >= 24GB
  3. Network Configuration: Ensure API keys and server addresses are configured correctly
  4. Concurrency Control: Control the number of parallel tools via the max_workers parameter

πŸ“ Citation

If you find this work helpful, please consider citing our paper:

@article{zhang2026think3d,
  title={Think3D: Thinking with Space for Spatial Reasoning},
  author={Zhang, Zaibin and Wu, Yuhan and Jia, Lianjie and Wang, Yifan and Zhang, Zhongbo and Li, Yijiang and Ran, Binghao and Zhang, Fuxi and Sun, Zhuohan and Yin, Zhenfei and others},
  journal={arXiv preprint arXiv:2601.13029},
  year={2026}
}

⭐ Star History

If you find SPAgent useful for your research or projects, please consider giving us a ⭐ star! Your support helps us continue improving and maintaining this project.

🌟 Thank you for your support! 🌟

Star History Chart

About

SPAgent, a spatial intelligence agent designed to operate in the physical and spatial world.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •