
We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting more general input expressions than those handled by prior works.
Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESanything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object/part attributes related to function, style, design, etc., to handle implicit queries without any part annotations for training or fine-tuning.
As the first zero-shot and LLM-based RES method, RESanything achieves superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. We contribute a new benchmark dataset of ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.
Overview of RESanything, a two-stage framework for zero-shot arbitrary RES.
The attribute prompting stage generates reference and candidate texts from input image and referring
expression using SAM-generated proposals and an MLLM.The mask proposal selection stage leverages MLLM
and CLIP to evaluate both candidates and proposals and produce the final response.
Method | refCOCO | refCOCO+ | refCOCOg | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
val | testA | testB | val | testA | testB | val(U) | val(G) | test(U) | |||||
fully-supervised on the training set | |||||||||||||
VLT | 67.5 | 70.5 | 65.2 | 56.3 | 61.0 | 50.1 | 55.0 | - | 57.7 | ||||
CRIS | 70.5 | 73.2 | 66.1 | 62.3 | 68.1 | 53.7 | 59.9 | - | 60.4 | ||||
LAVT | 72.7 | 75.8 | 68.8 | 62.1 | 68.4 | 55.1 | 61.2 | - | 62.1 | ||||
GRES | 73.8 | 76.5 | 70.2 | 66.0 | 71.0 | 57.7 | 65.0 | - | 66.0 | ||||
pre-trained on the same task | |||||||||||||
UniRES | 71.2 | 74.8 | 66.0 | 59.9 | 66.7 | 51.4 | 62.3 | - | 63.2 | ||||
LISA-7B | 74.9 | 79.1 | 72.3 | 65.1 | 70.8 | 58.1 | 67.9 | - | 70.6 | ||||
GSVA | 77.2 | 78.9 | 73.5 | 65.9 | 69.6 | 59.8 | 72.7 | - | 73.3 | ||||
GLaMM | 79.5 | 83.2 | 76.9 | 72.6 | 78.7 | 64.6 | 74.2 | - | 74.9 | ||||
SAM4MLLM | 79.8 | 82.7 | 74.7 | 74.6 | 80.0 | 67.2 | 75.5 | - | 76.4 | ||||
training-free zero-shot | |||||||||||||
GLCLIP | 26.2 | 24.9 | 26.6 | 27.8 | 25.6 | 27.8 | 33.5 | 33.6 | 33.7 | ||||
CaR | 33.6 | 35.4 | 30.5 | 34.2 | 36.0 | 31.0 | 36.7 | 36.6 | 36.6 | ||||
Ours | 68.5 | 72.2 | 70.3 | 60.7 | 65.6 | 52.2 | 60.1 | 60.5 | 60.9 |
Method | gIoU | cIoU | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
pre-trained / supervised | GLaMM | 47.4 | 47.2 | |||||||||
LISA-7B-LLaVA1.5 | 53.6 | 52.3 | ||||||||||
LISA-13B-LLaVA1.5 | 57.7 | 60.3 | ||||||||||
SAM4MLLM | 58.4 | 60.4 | training-free zero-shot | |||||||||
CaR | 35.2 | 26.4 | ||||||||||
Ours | 74.6 | 72.5 |
To further evaluate the capability of RESanything in handling implicit expressions (e.g.,part-level materials, features, and functionalities), we establish the ABO-image-ares benchmark for complex reasoning segmentation tasks.
Method | gIoU | cIoU | |
---|---|---|---|
LISA-13B-LLaVA1.5 | 43.3 | 34.0 | |
GLaMM | 46.2 | 38.7 | |
CaR | 24.4 | 15.7 | |
Ours | 78.2 | 72.4 |
@article{wang2025resanything,
author = {Wang, Ruiqi and Zhang, Hao},
title = {RESAnything: Attribute Prompting for Arbitrary Referring Segmentation},
journal = {arXiv preprint arXiv:2505.02867},
year = {2025},
}