LEO-VL: Efficient Scene Representation for
Scalable 3D Vision-Language Learning

1Peking University 2Beijing Institute for General Artificial Intelligence (BIGAI)
3Tsinghua University 4Beijing University of Posts and Telecommunications

Overview

LEO-VL overview

LEO-VL features an efficient scene representation with significantly reduced representation costs, unlocking the scalability of 3D-VL learning across diverse scene domains and tasks.

Summary

⚠️ Obstacles for 3D VLMs    

  • 🧊  Representation capacity-efficiency trade-off
  • 🗂️  Fragmented data with limited task and scene diversity
  • ⚙️  Lack of effective post-training for robustness issues

Our solutions        

  • 🧊  Efficient representation: condensed feature grid
  • 🗂️  Comprehensive data scheme: 4 domains × 5 tasks
  • ⚙️  Effective post-training objective: SceneDPO

Model

LEO-VL model

LEO-VL model flow: RGB-D inputs ➜ 2D perception ➜ back-projection ➜ voxels ➜ condensed feature grid ➜ LLM

Accuracy-efficiency Pareto optimum

LEO-VL achieves a new accuracy-efficiency Pareto optimum

Histogram statistics of representation tokens

Condensed feature grid significantly reduces token overhead

Data

LEO-VL data overview: "SV" stands for SceneVerse, "MM" for MMScan, "✓" for newly created data in this work, and "-" for filtered data due to quality control. The "4 domains × 5 tasks" scheme yields over 700k 3D-VL data samples.

LEO-VL data
Qualitative examples

Qualitative examples of LEO-VL performing diverse tasks across diverse scene domains

Post-Training

$$\mathcal{L}_a = -\mathbb{E}_\mathcal{D} \log \sigma\left(\beta_a \left[\log \frac{\pi_\theta(a_\checkmark|s_\checkmark,q)}{\pi_{ref}(a_\checkmark|s_\checkmark,q)} - \log \frac{\pi_\theta(a_\times|s_\checkmark,q)}{\pi_{ref}(a_\times|s_\checkmark,q)}\right]\right)$$
$$\mathcal{L}_s = -\mathbb{E}_\mathcal{D} \log \sigma\left(\beta_s \left[\log \frac{\pi_\theta(a_\checkmark|s_\checkmark,q)}{\pi_{ref}(a_\checkmark|s_\checkmark,q)} - \log \frac{\pi_\theta(a_\checkmark|s_\times,q)}{\pi_{ref}(a_\checkmark|s_\times,q)}\right]\right)$$
$$\mathcal{L}_{NLL} = -\mathbb{E}_\mathcal{D} \log \pi_\theta(a_\checkmark | s_\checkmark, q)$$
$$\mathcal{L} = w_a\mathcal{L}_a + w_s\mathcal{L}_s + \mathcal{L}_{NLL}$$

SceneDPO is a post-training objective for 3D VLMs with three core components:

  • Answer contrast: improving preference of positive answers against negative answers
  • Scene contrast: encouraging the model to exploit scene context by discouraging positive answers when conditioned on irrelevant scene context
  • NLL loss: incorporating a negative log likelihood (NLL) loss for optimization stability
Post-training results

Post-training results, including in-domain (SQA3D) and out-of-domain (Beacon3D) evaluation

1 / 3

BibTeX

If you find our work helpful, please consider citing us:

@article{huang2025leovl,
  title={LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning},
  author={Huang, Jiangyong and Ma, Xiaojian and Linghu, Xiongkun and He, Junchao and Li, Qing and Zhu, Song-Chun and Chen, Yixin and Jia, Baoxiong and Huang, Siyuan},
  journal={arXiv preprint arXiv:2506.09935},
  year={2025}
}