Chance-Adjusted Accuracy* scores on the (8,068 examples) SITE benchmark.
| Model | Overall | Count | Loc | 3D Inf | MultiV | Rel | Mov |
| Random | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Tiny Subset | |||||||
| Human | 67.5 | 66.0 | 83.3 | 54.7 | 87.5 | 73.0 | 52.5 |
| InternVL-2.5-8B | 34.3 | 48.5 | 46.8 | 9.32 | 8.51 | 45.6 | 23.7 |
| GPT-4o | 35.6 | 42.4 | 51.2 | 11.0 | 17.8 | 42.7 | 19.5 |
| Open-source | |||||||
| InternVL-2.5-8B | 32.8 | 47.1 | 37.0 | 23.2 | 9.05 | 47.6 | 28.7 |
| Qwen2.5-VL-7B | 31.4 | 52.6 | 44.1 | 9.42 | 1.08 | 51.5 | 18.9 |
| LLAVA-OV-7B | 30.2 | 51.8 | 38.5 | 22.4 | 9.40 | 55.3 | 9.18 |
| Qwen2.5-VL-3B | 29.5 | 45.6 | 37.5 | 13.2 | 7.14 | 45.6 | 18.8 |
| InternVL-2.5-4B | 29.4 | 47.9 | 32.9 | 11.4 | 3.94 | 47.2 | 22.9 |
| Phi-3.5-Vision | 21.8 | 33.2 | 34.0 | 11.7 | 3.33 | 32.8 | 11.7 |
| LLAVA-OV-0.5B | 18.4 | 28.0 | 32.3 | 5.67 | 3.77 | 30.6 | 4.70 |
| Proprietary | |||||||
| GPT-4o | 37.8 | 44.6 | 56.0 | 26.9 | 22.0 | 54.6 | 18.4 |
| Gemini-1.5-Pro | 32.5 | 48.0 | 45.8 | 25.3 | 5.33 | 48.8 | 18.4 |
Chance-Adjusted Accuracy*: Subtract the chance level from the raw accuracy score so that 0 means "just chance" and 1 means "perfect".
Spatial Intelligence Categories: Count: Counting and Existence, Loc: Localization and Positioning, 3D Inf: 3D Information Understanding,
MultiV: Multi-View and Cross-Image Reasoning, Rel: Spatial Relationship Reasoning, Mov: Movement Prediction and Navigation
🚨 To submit your results to the leaderboard, please send to this email with your result json files.