RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Abstract

We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting more general input expressions than those handled by prior works.

Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESanything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object/part attributes related to function, style, design, etc., to handle implicit queries without any part annotations for training or fine-tuning.

As the first zero-shot and LLM-based RES method, RESanything achieves superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. We contribute a new benchmark dataset of ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

Video

Overview of RESanything, a two-stage framework for zero-shot arbitrary RES. The attribute prompting stage generates reference and candidate texts from input image and referring expression using SAM-generated proposals and an MLLM.The mask proposal selection stage leverages MLLM and CLIP to evaluate both candidates and proposals and produce the final response.

Vanilla Referring Expression Segmentation

Table 1: Quantitative results on standard RES benchmarks refCOCO/+/g, reported as cIoU values.
Method		refCOCO			refCOCO+			refCOCOg
	val	testA	testB	val	testA	testB	val(U)	val(G)	test(U)
fully-supervised on the training set
VLT	67.5	70.5	65.2	56.3	61.0	50.1	55.0	-	57.7
CRIS	70.5	73.2	66.1	62.3	68.1	53.7	59.9	-	60.4
LAVT	72.7	75.8	68.8	62.1	68.4	55.1	61.2	-	62.1
GRES	73.8	76.5	70.2	66.0	71.0	57.7	65.0	-	66.0
pre-trained on the same task
UniRES	71.2	74.8	66.0	59.9	66.7	51.4	62.3	-	63.2
LISA-7B	74.9	79.1	72.3	65.1	70.8	58.1	67.9	-	70.6
GSVA	77.2	78.9	73.5	65.9	69.6	59.8	72.7	-	73.3
GLaMM	79.5	83.2	76.9	72.6	78.7	64.6	74.2	-	74.9
SAM4MLLM	79.8	82.7	74.7	74.6	80.0	67.2	75.5	-	76.4
training-free zero-shot
GLCLIP	26.2	24.9	26.6	27.8	25.6	27.8	33.5	33.6	33.7
CaR	33.6	35.4	30.5	34.2	36.0	31.0	36.7	36.6	36.6
Ours	68.5	72.2	70.3	60.7	65.6	52.2	60.1	60.5	60.9

Reasoning Segmentation

Table 2: Quantitative results on ReasonSeg validation set.
Method		gIoU	cIoU
Method	pre-trained / supervised
GLaMM		47.4	47.2
LISA-7B-LLaVA1.5		53.6	52.3
LISA-13B-LLaVA1.5		57.7	60.3
SAM4MLLM		58.4	60.4
training-free zero-shot
CaR		35.2	26.4
Ours		74.6	72.5

ABO-image-ares

To further evaluate the capability of RESanything in handling implicit expressions (e.g.,part-level materials, features, and functionalities), we establish the ABO-image-ares benchmark for complex reasoning segmentation tasks.

Table 3: Quantitative results on ABO-image-ares.
Method		gIoU	cIoU
Method	LISA-13B-LLaVA1.5		43.3	34.0
GLaMM		46.2	38.7
CaR		24.4	15.7
Ours		78.2	72.4

Acknowledgement

We thank Yiming Qian, Kai Wang and Fenggen Yu for their invaluable contributions to this project.

BibTeX

@article{wang2025resanything,
  author    = {Wang, Ruiqi and Zhang, Hao},
  title     = {RESAnything: Attribute Prompting for Arbitrary Referring Segmentation},
  journal   = {arXiv preprint arXiv:2505.02867},
  year      = {2025},
}