RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Simon Fraser University

Open-vocabulary and zero-shot referring expression segmentation with RESanything.

Abstract

We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting more general input expressions than those handled by prior works.

Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESanything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object/part attributes related to function, style, design, etc., to handle implicit queries without any part annotations for training or fine-tuning.

As the first zero-shot and LLM-based RES method, RESanything achieves superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. We contribute a new benchmark dataset of ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

Video



Overview of RESanything, a two-stage framework for zero-shot arbitrary RES. The attribute prompting stage generates reference and candidate texts from input image and referring expression using SAM-generated proposals and an MLLM.The mask proposal selection stage leverages MLLM and CLIP to evaluate both candidates and proposals and produce the final response.

Vanilla Referring Expression Segmentation

Table 1: Quantitative results on standard RES benchmarks refCOCO/+/g, reported as cIoU values.
Method refCOCO refCOCO+ refCOCOg
val testA testB val testA testB val(U) val(G) test(U)
fully-supervised on the training set
VLT 67.5 70.5 65.2 56.3 61.0 50.1 55.0 - 57.7
CRIS 70.5 73.2 66.1 62.3 68.1 53.7 59.9 - 60.4
LAVT 72.7 75.8 68.8 62.1 68.4 55.1 61.2 - 62.1
GRES 73.8 76.5 70.2 66.0 71.0 57.7 65.0 - 66.0
pre-trained on the same task
UniRES 71.2 74.8 66.0 59.9 66.7 51.4 62.3 - 63.2
LISA-7B 74.9 79.1 72.3 65.1 70.8 58.1 67.9 - 70.6
GSVA 77.2 78.9 73.5 65.9 69.6 59.8 72.7 - 73.3
GLaMM 79.5 83.2 76.9 72.6 78.7 64.6 74.2 - 74.9
SAM4MLLM 79.8 82.7 74.7 74.6 80.0 67.2 75.5 - 76.4
training-free zero-shot
GLCLIP 26.2 24.9 26.6 27.8 25.6 27.8 33.5 33.6 33.7
CaR 33.6 35.4 30.5 34.2 36.0 31.0 36.7 36.6 36.6
Ours 68.5 72.2 70.3 60.7 65.6 52.2 60.1 60.5 60.9

Reasoning Segmentation

Table 2: Quantitative results on ReasonSeg validation set.
Method gIoU cIoU
pre-trained / supervised
GLaMM 47.4 47.2
LISA-7B-LLaVA1.5 53.6 52.3
LISA-13B-LLaVA1.5 57.7 60.3
SAM4MLLM 58.4 60.4
training-free zero-shot
CaR 35.2 26.4
Ours 74.6 72.5

ABO-image-ares

To further evaluate the capability of RESanything in handling implicit expressions (e.g.,part-level materials, features, and functionalities), we establish the ABO-image-ares benchmark for complex reasoning segmentation tasks.


Table 3: Quantitative results on ABO-image-ares.
Method gIoU cIoU
LISA-13B-LLaVA1.5 43.3 34.0
GLaMM 46.2 38.7
CaR 24.4 15.7
Ours 78.2 72.4

Acknowledgement

We thank Yiming Qian, Kai Wang and Fenggen Yu for their invaluable contributions to this project.

BibTeX

@article{wang2025resanything,
  author    = {Wang, Ruiqi and Zhang, Hao},
  title     = {RESAnything: Attribute Prompting for Arbitrary Referring Segmentation},
  journal   = {arXiv preprint arXiv:2505.02867},
  year      = {2025},
}