LEO-VL

LEO-VL: Efficient Scene Representation for
Scalable 3D Vision-Language Learning

¹Peking University ²Beijing Institute for General Artificial Intelligence (BIGAI)
³Tsinghua University ⁴Beijing University of Posts and Telecommunications

LEO-VL features an efficient scene representation with significantly reduced representation costs, unlocking the scalability of 3D-VL learning across diverse scene domains and tasks.

Summary

⚠️ Obstacles for 3D VLMs

🧊 Representation capacity-efficiency trade-off
🗂️ Fragmented data with limited task and scene diversity
⚙️ Lack of effective post-training for robustness issues

✨ Our solutions

🧊 Efficient representation: condensed feature grid
🗂️ Comprehensive data scheme: 4 domains × 5 tasks
⚙️ Effective post-training objective: SceneDPO

Model

LEO-VL model flow: RGB-D inputs ➜ 2D perception ➜ back-projection ➜ voxels ➜ condensed feature grid ➜ LLM

LEO-VL achieves a new accuracy-efficiency Pareto optimum

Histogram statistics of representation tokens

Condensed feature grid significantly reduces token overhead

Data

LEO-VL data overview: "SV" stands for SceneVerse, "MM" for MMScan, "✓" for newly created data in this work, and "-" for filtered data due to quality control. The "4 domains × 5 tasks" scheme yields over 700k 3D-VL data samples.

Qualitative examples of LEO-VL performing diverse tasks across diverse scene domains

Consistent data scaling effect

1 / 5

Post-Training

$$\mathcal{L}_a = -\mathbb{E}_\mathcal{D} \log \sigma\left(\beta_a \left[\log \frac{\pi_\theta(a_\checkmark|s_\checkmark,q)}{\pi_{ref}(a_\checkmark|s_\checkmark,q)} - \log \frac{\pi_\theta(a_\times|s_\checkmark,q)}{\pi_{ref}(a_\times|s_\checkmark,q)}\right]\right)$$

$$\mathcal{L}_s = -\mathbb{E}_\mathcal{D} \log \sigma\left(\beta_s \left[\log \frac{\pi_\theta(a_\checkmark|s_\checkmark,q)}{\pi_{ref}(a_\checkmark|s_\checkmark,q)} - \log \frac{\pi_\theta(a_\checkmark|s_\times,q)}{\pi_{ref}(a_\checkmark|s_\times,q)}\right]\right)$$

$$\mathcal{L}_{NLL} = -\mathbb{E}_\mathcal{D} \log \pi_\theta(a_\checkmark | s_\checkmark, q)$$

$$\mathcal{L} = w_a\mathcal{L}_a + w_s\mathcal{L}_s + \mathcal{L}_{NLL}$$

SceneDPO is a post-training objective for 3D VLMs with three core components:

• Answer contrast: improving preference of positive answers against negative answers
• Scene contrast: encouraging the model to exploit scene context by discouraging positive answers when conditioned on irrelevant scene context
• NLL loss: incorporating a negative log likelihood (NLL) loss for optimization stability

Post-training results, including in-domain (SQA3D) and out-of-domain (Beacon3D) evaluation

1 / 3

BibTeX

If you find our work helpful, please consider citing us:

@article{huang2025leovl, title={LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning}, author={Huang, Jiangyong and Ma, Xiaojian and Linghu, Xiongkun and He, Junchao and Li, Qing and Zhu, Song-Chun and Chen, Yixin and Jia, Baoxiong and Huang, Siyuan}, journal={arXiv preprint arXiv:2506.09935}, year={2025} }

LEO-VL: Efficient Scene Representation forScalable 3D Vision-Language Learning

Overview