Accuracy scores on the testmini subset (1,000 examples) of
MathVista.
# | Model | Method | Source | Date | ALL | FQA | GPS | MWP | TQA | VQA | ALG | ARI | GEO | LOG | NUM | SCI | STA |
- | Human | - | Link | 2023-10-03 | 60.3 | 59.7 | 48.4 | 73.0 | 63.2 | 55.9 | 50.9 | 59.2 | 51.4 | 40.7 | 53.8 | 64.9 | 63.9 |
1 | GPT-4V (Playground)🥇 | LMM 🖼️ | Link | 2023-10-15 | 49.9 | 43.1 | 50.5 | 57.5 | 65.2 | 38.0 | 53.0 | 49.0 | 51.0 | 21.6 | 20.1 | 63.1 | 55.8 |
2 | SPHINX (V2) 🥈 | LMM 🖼️ | Link | 2023-11-17 | 36.7 | 54.7 | 16.4 | 23.1 | 41.8 | 43.0 | 20.6 | 33.4 | 17.6 | 24.3 | 21.5 | 43.4 | 51.5 |
3 | Multimodal Bard 🥉 | LMM 🖼️ | Link | 2023-10-03 | 34.8 | 26.0 | 47.1 | 29.6 | 48.7 | 26.8 | 46.5 | 28.6 | 47.8 | 13.5 | 14.9 | 47.5 | 33.0 |
4 | PoT GPT-4 (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 33.9 | 30.1 | 39.4 | 30.6 | 39.9 | 31.3 | 37.4 | 31.7 | 41.0 | 18.9 | 20.1 | 44.3 | 37.9 |
5 | CoT GPT-4 (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 33.2 | 27.9 | 31.7 | 31.2 | 51.9 | 28.5 | 33.5 | 30.9 | 32.2 | 13.5 | 12.5 | 58.2 | 37.9 |
6 | CoT ChatGPT (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 33.2 | 27.5 | 29.3 | 36.0 | 49.4 | 29.1 | 31.0 | 32.9 | 31.0 | 16.2 | 17.4 | 50.8 | 37.2 |
7 | CoT Claude-2 (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 33.2 | 26.0 | 31.7 | 35.5 | 48.1 | 30.2 | 32.4 | 32.3 | 33.0 | 16.2 | 17.4 | 54.9 | 36.2 |
8 | SPHINX (V1) | LMM 🖼️ | Link | 2023-11-09 | 27.5 | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 | 25.6 | 28.1 | 23.4 | 16.2 | 17.4 | 40.2 | 23.6 |
9 | PoT ChatGPT (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 26.8 | 24.5 | 26.4 | 23.7 | 33.5 | 27.9 | 27.8 | 26.1 | 28.0 | 18.9 | 13.2 | 33.6 | 29.9 |
10 | LLaVA (LLaMA-2-13B) | LMM 🖼️ | Link | 2023-10-03 | 26.1 | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 | 27.3 | 20.1 | 28.8 | 24.3 | 18.3 | 37.3 | 25.1 |
11 | InstructBLIP (Vicuna-7B) | LMM 🖼️ | Link | 2023-10-03 | 25.3 | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 | 21.8 | 27.1 | 20.7 | 18.9 | 20.4 | 33.0 | 23.1 |
11 | LLaVAR | LMM 🖼️ | Link | 2023-10-03 | 25.2 | 21.9 | 25.0 | 16.7 | 34.8 | 30.7 | 24.2 | 22.1 | 23.0 | 13.5 | 15.3 | 42.6 | 21.9 |
13 | LLaMA-Adapter-V2 (7B) | LMM 🖼️ | Link | 2023-10-03 | 23.9 | 21.2 | 25.5 | 11.3 | 32.3 | 31.8 | 26.3 | 20.4 | 24.3 | 24.3 | 13.9 | 29.5 | 18.3 |
14 | miniGPT4 (LLaMA-2-7B) | LMM 🖼️ | Link | 2023-10-03 | 23.1 | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 | 28.1 | 21.0 | 24.7 | 16.2 | 16.7 | 25.4 | 17.9 |
15 | mPLUG-Owl (LLaMA-7B) | LMM 🖼️ | Link | 2023-10-03 | 22.2 | 22.7 | 23.6 | 10.2 | 27.2 | 27.9 | 23.6 | 19.2 | 23.9 | 13.5 | 12.7 | 26.3 | 21.4 |
16 | IDEFICS (9B-Instruct) | LMM 🖼️ | Link | 2023-10-03 | 19.8 | 21.6 | 21.1 | 6.5 | 25.9 | 24.0 | 22.1 | 15.0 | 19.8 | 18.9 | 9.9 | 24.6 | 18.1 |
17 | Random Chance | - | Link | 2023-10-03 | 17.9 | 18.2 | 21.6 | 3.8 | 19.6 | 26.3 | 21.7 | 14.7 | 20.1 | 13.5 | 8.3 | 17.2 | 16.3 |
Task types: FQA: figure QA, GPS: geometry problem solving, MWP: math word problem, TQA: textbook QA, VQA: visual QA.
Math reasoning types: ALG: algebraic, ARI: arithmetic, GEO: geometry, LOG: logical , NUM: numeric, SCI: scientific, STA: statistical.
🚨 To submit your results to the leaderboard, please send to this mail with your result json file.