MESS Benchmark

Leaderboard

Our evaluation currently focus on zero-shot transfer models as they are designed to generalize to unseen datasets. Links to the models are provided at the end of the page. The results include mIoU values averaged by the dataset domains as well as for each dataset separate. The models are grouped by their size and sorted by the release date. The best-performing models within their group are highlighted in bold, the second-best are underlined. We provide random and best supervised results as a lower and upper bound.

Zero-shot semantic segmentation

Domain results

Model General Earth Monitoring Medical Sciences Engineering Agriculture and Biology Mean
Random1 1.17 7.11 29.51 11.71 6.14 10.27
Best supervised2 49.15 79.12 89.49 67.66 81.94 71.13
             
ZSSeg-B 19.98 17.98 41.82 14.0 22.32 22.73
ZegFormer-B 13.57 17.25 17.47 17.92 25.78 17.57
X-Decoder-T 22.01 18.92 23.28 15.31 18.17 19.8
SAN-B 29.35 30.64 29.85 23.58 15.07 26.74
OpenSeeD-T 22.49 25.11 44.44 16.5 10.35 24.33
CAT-Seg-B 34.96 34.57 41.65 26.26 29.32 33.74
Grounded-SAM-B 29.51 25.97 37.38 29.51 17.66 28.52
             
OVSeg-L 29.54 29.04 31.9 14.16 28.64 26.94
SAN-L 36.18 38.83 30.27 16.95 20.41 30.06
CAT-Seg-L 39.93 39.85 48.49 26.04 34.06 38.14
Grounded-SAM-L 30.32 26.44 38.69 29.25 17.73 29.05
CAT-Seg-H 37.98 37.74 34.65 29.04 37.76 35.66
Grounded-SAM-H 30.27 26.44 38.45 28.16 17.67 28.78

Dataset results

Model BDD100K Dark Zurich MHP v1 FoodSeg103 ATLANTIS DRAM iSAID ISPRS Potsdam WorldFloods FloodNet UAVid Kvasir-Instrument CHASE DB1 CryoNuSeg PAXRay-4 Corrosion CS DeepCrack PST900 ZeroWaste-f SUIM CUB-200 CWFID Mean
Random1 1.48 1.31 1.27 0.23 0.56 2.16 0.56 8.02 18.43 3.39 5.18 27.99 27.25 31.25 31.53 9.3 26.52 4.52 6.49 5.3 0.06 13.08 10.27
Best supervised2 44.8 63.9 50.0 45.1 42.22 45.71 65.3 87.56 92.71 82.22 67.8 93.7 97.05 73.45 93.77 49.92 85.9 82.3 52.5 74.0 84.6 87.23 70.99
                                               
ZSSeg-B 32.36 16.86 7.08 8.17 22.19 33.19 3.8 11.57 23.25 20.98 30.27 46.93 37.0 38.7 44.66 3.06 25.39 18.76 8.78 30.16 4.35 32.46 22.73
ZegFormer-B 14.14 4.52 4.33 10.01 18.98 29.45 2.68 14.04 25.93 22.74 20.84 27.39 12.47 11.94 18.09 4.78 29.77 19.63 17.52 28.28 16.8 32.26 17.57
X-Decoder-T 47.29 24.16 3.54 2.61 27.51 26.95 2.43 31.47 26.23 8.83 25.65 55.77 10.16 11.94 15.23 1.72 24.65 19.44 15.44 24.75 0.51 29.25 19.8
SAN-B 37.4 24.35 8.87 19.27 36.51 49.68 4.77 37.56 31.75 37.44 41.65 69.88 17.85 11.95 19.73 3.13 50.27 19.67 21.27 22.64 16.91 5.67 26.74
OpenSeeD-T 47.95 28.13 2.06 9.0 18.55 29.23 1.45 31.07 30.11 23.14 39.78 59.69 46.68 33.76 37.64 13.38 47.84 2.5 2.28 19.45 0.13 11.47 24.33
CAT-Seg-B 44.58 27.36 20.79 21.54 33.08 62.42 15.75 41.89 39.47 35.12 40.62 70.68 25.38 25.63 44.94 13.76 49.14 21.32 20.83 39.1 3.4 45.47 33.74
Grounded-SAM-B 41.58 20.91 29.38 10.48 17.33 57.38 12.22 26.68 33.41 19.19 38.34 46.82 23.56 38.06 41.07 20.88 59.02 21.39 16.74 14.13 0.43 38.41 28.52
                                               
OVSeg-L 45.28 22.53 6.24 16.43 33.44 53.33 8.28 31.03 31.48 35.59 38.8 71.13 20.95 13.45 22.06 6.82 16.22 21.89 11.71 38.17 14.0 33.76 26.94
SAN-L 43.81 30.39 9.34 24.46 40.66 68.44 11.77 51.45 48.24 39.26 43.41 72.18 7.64 11.94 29.33 6.83 23.65 19.01 18.32 40.01 19.3 1.91 30.06
CAT-Seg-L 45.83 33.1 30.03 30.47 33.6 66.54 16.09 51.42 49.86 39.84 42.02 79.4 24.99 35.06 54.5 16.87 31.42 25.26 30.62 53.94 9.24 39.0 38.14
Grounded-SAM-L 42.69 21.92 28.11 10.76 17.63 60.8 12.38 27.76 33.4 19.28 39.37 47.32 25.16 38.06 44.22 20.88 58.21 21.23 16.67 14.3 0.43 38.47 29.05
CAT-Seg-H 48.34 29.72 23.53 29.06 40.43 56.78 9.04 49.37 47.92 40.98 41.36 70.7 13.37 12.82 41.72 12.17 57.69 19.61 26.71 47.8 19.49 45.99 35.66
Grounded-SAM-H 42.95 22.09 28.05 9.97 17.68 60.86 12.44 27.79 33.23 19.31 39.41 46.97 25.13 38.06 43.64 20.88 53.74 21.34 16.68 14.3 0.43 38.29 28.78

Visual oracle prompts

Additionally to langugage guided mdoels, we evaluated SAM using visual oracle prompts in a point-to-mask or box-to-mask setting. An oracle point or box is provided for each connected segment. See our paper for details.

Domain results

Model General Earth Monitoring Medical Sciences Engineering Agriculture and Biology Mean
Random1 1.17 7.11 29.51 11.71 6.14 10.27
Best supervised2 49.15 79.12 89.49 67.66 81.94 71.13
             
SAM-B with oracle points 50.41 38.72 43.7 45.16 57.84 46.59
SAM-L with oracle points 45.99 44.03 55.74 50.0 58.23 49.99
SAM-H with oracle points 36.05 34.82 59.58 47.35 39.91 43.0
             
SAM-B with oracle boxes 78.5 73.56 68.14 73.29 86.0 75.67
SAM-L with oracle boxes 78.0 73.27 64.98 73.09 86.99 74.97
SAM-H with oracle boxes 65.23 59.61 66.58 66.4 78.63 66.55

Dataset results

Model BDD100K Dark Zurich MHP v1 FoodSeg103 ATLANTIS DRAM iSAID ISPRS Potsdam WorldFloods FloodNet UAVid Kvasir-Instrument CHASE DB1 CryoNuSeg PAXRay-4 Corrosion CS DeepCrack PST900 ZeroWaste-f SUIM CUB-200 CWFID Mean
Random1 1.48 1.31 1.27 0.23 0.56 2.16 0.56 8.02 18.43 3.39 5.18 27.99 27.25 31.25 31.53 9.3 26.52 4.52 6.49 5.3 0.06 13.08 10.27
Best supervised2 44.8 63.9 50.0 45.1 42.22 45.71 65.3 87.56 92.71 82.22 67.8 93.7 97.05 73.45 93.77 49.92 85.9 82.3 52.5 74.0 84.6 87.23 70.99
                                               
SAM-B with oracle points 50.47 35.19 44.55 58.48 61.5 52.29 21.54 38.93 32.36 58.83 41.93 67.25 37.1 23.95 46.51 35.99 47.16 36.39 61.09 64.71 64.35 44.47 46.59
SAM-L with oracle points 38.09 38.46 49.51 46.85 53.68 49.34 45.03 41.58 nan 53.87 35.65 85.35 30.68 51.63 55.29 42.75 48.82 46.56 61.86 55.68 75.1 43.91 49.99
SAM-H with oracle points 27.64 37.11 53.76 31.02 35.69 31.1 52.65 29.34 nan 33.78 23.5 84.22 33.35 64.11 56.62 34.54 55.08 55.76 43.99 28.54 48.91 42.3 43.0
                                               
SAM-B with oracle boxes 72.66 68.67 82.47 86.37 81.64 79.2 75.53 68.65 nan 76.49 73.59 92.58 22.59 85.23 72.17 67.01 66.49 75.46 84.22 86.42 86.88 84.7 75.67
SAM-L with oracle boxes 70.58 67.1 81.94 85.44 81.36 81.6 75.0 68.5 nan 76.78 72.81 93.5 22.82 76.1 67.49 64.15 69.35 73.97 84.9 87.43 89.48 84.05 74.97
SAM-H with oracle boxes 57.93 59.96 76.22 60.2 68.82 68.26 73.72 52.04 nan 59.4 53.29 91.03 33.41 75.16 66.73 56.75 66.55 67.78 74.52 67.85 84.71 83.32 66.55

Model implementations

Links to the official implementations and the code adaptations for the MESS benchmark:

Model Official implementation MESS adaptation
ZSSeg https://github.com/MendelXu/zsseg.baseline https://github.com/blumenstiel/zsseg.baseline-MESS
ZegFormer https://github.com/dingjiansw101/ZegFormer https://github.com/blumenstiel/ZegFormer-MESS
OVSeg https://github.com/facebookresearch/ov-seg https://github.com/blumenstiel/ov-seg-MESS
X-Decoder https://github.com/microsoft/X-Decoder https://github.com/blumenstiel/X-Decoder-MESS
SAN https://github.com/MendelXu/SAN https://github.com/blumenstiel/SAN-MESS
OpenSeeD https://github.com/IDEA-Research/OpenSeeD https://github.com/blumenstiel/OpenSeeD-MESS
CAT-Seg https://github.com/KU-CVLAB/CAT-Seg https://github.com/blumenstiel/CAT-Seg-MESS
Grounded-SAM https://github.com/IDEA-Research/Grounded-Segment-Anything https://github.com/blumenstiel/Grounded-SAM-MESS
SAM https://github.com/facebookresearch/segment-anything https://github.com/blumenstiel/SAM-MESS

Feel free to add your results by contacting us via email.

1 Random is a lower bound. The values represent the expected mIoU from predictions with uniform class distribution.

2 Best supervised are recent supervised models for each dataset individually. We refer to our paper for the details.