GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

CVPR 2025

🎉 Oral Presentation (Top 3.3%)

Jieming Cui1,2,*, Tengyu Liu2,*, Ziyu Meng2,3, Jiale Yu4, Ran Song3, Wei Zhang3,
Yixin Zhu1,📧, Siyuan Huang2,📧
1Institute for Artificial Intelligence, Peking University, 2National Key Laboratory of General Artificial Intelligence, BIGAI, 3School of Control Science and Engineering, Shandong University, 4Department of Automation, Tsinghua University

GROVE generate open-vocabulary and physical-plausiable motions through generalized reward.

Abstract

Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that LLM and VLM provide complementary guidance---LLM generate precise physical constraints capturing task requirements, while VLM evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.

More motions

Position body in a shape of 'C'

Walking like a model

Playing the suona

Jump rope

BibTeX

@inproceedings{cui2025grove,
  title={GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill},
  author={Cui, Jieming and Liu, Tengyu and Meng, Ziyu and Yu, Jiale and Song, Ran and Zhang, Wei and Zhu, Yixin and Huang, Siyuan},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}