Accuracy scores on the testmini subset (1,000 examples) of MathVista.
Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance our proposed MathVista across mathematical reasoning and visual context types. PoT refers to program-of-thought prompting, and PoT GPT-4 is a textual LLM augmented with the caption and OCR text. GPT-4V is manually evaluated via the playground chatbot. The scores of Gemini Ultra are from the Gemini Team, Google.
Accuracy scores of one leading LLM (i.e., PoT GPT-4), four primary LMMs, random chance, and human performance our proposed MathVista across mathematical reasoning and visual context types. PoT refers to program-of-thought prompting, and PoT GPT-4 is a textual LLM augmented with the caption and OCR text. GPT-4V is manually evaluated via the playground chatbot.
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied.
To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging.
With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the use of self-consistency, and the goal-directed multi-turn human-AI dialogues, highlighting the promising potential of GPT-4V for future research.
Accuracy scores on the testmini subset (1,000 examples) of MathVista.
Accuracy scores on the test subset (5,141 examples with private ground truth) of MathVista.
# | Model | Method | Source | Date | ALL | FQA | GPS | MWP | TQA | VQA | ALG | ARI | GEO | LOG | NUM | SCI | STA |
1 | InternVL2-Pro 🥇 | LMM 🖼️ | Link | 2024-09-04 | 65.84 | 65.0 | 64.0 | 75.4 | 72.4 | 52.3 | 67.4 | 63.1 | 65.0 | 30.4 | 44.5 | 67.0 | 72.5 |
2 | InternVL2-8B-MPO 🥈 | LMM 🖼️ | Link | 2024-11-14 | 65.65 | 67.4 | 68.0 | 73.0 | 65.3 | 51.5 | 66.7/td> | 60.6 | 68.5 | 19.1 | 43.1 | 64.9 | 77.1 | 3 | InternVL-Chat-V1.2-Plus 🥉 | LMM 🖼️ | Link | 2024-02-22 | 60.18 | 52.2 | 56.2 | 78.3 | 61.6 | 55.5 | 56.0 | 64.4 | 57.6 | 21.6 | 46.1 | 60.0 | 60.1 |
4 | InternLM-XComposer2-VL-7B | LMM 🖼️ | Link | 2024-01-22 | 57.93 | 53.9 | 56.4 | 77.1 | 58.4 | 43.2 | 54.8 | 57.6 | 58.0 | 16.5 | 47.6 | 59.1 | 62.5 |
5 | Qwen-VL-Plus | LMM 🖼️ | Link | 2023-12-26 | 44.33 | 55.9 | 34.7 | 29.7 | 58.8 | 42.4 | 40.7 | 35.4 | 36.6 | 21.6 | 30.4 | 55.9 | 56.3 |
6 | SPHINX-MoE | MoE 🤖 | Link | 2024-01-13 | 42.68 | 50.3 | 29.7 | 40.9 | 49.3 | 43.3 | 33.9 | 43.0 | 29.1 | 14.4 | 26.3 | 46.9 | 51.2 |
7 | MiniCPM-V-2 (2.8B) | LMM 🖼️ | Link | 2024-04-14 | 39.89 | 51.7 | 27.4 | 39.8 | 42.5 | 34.7 | 31.3 | 34.4 | 30.7 | 13.4 | 33.5 | 38.5 | 50.0 |
8 | PoT GPT-4 (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 31.74 | 27.6 | 37.4 | 23.9 | 43.0 | 30.3 | 37.1 | 27.9 | 37.5 | 22.7 | 15.8 | 44.5 | 31.9 |
9 | CoT GPT4 (Caption+OCR) | Tool 🛠️ | Link | 2023-10-03 | 30.50 | 27.2 | 35.9 | 21.3 | 43.1 | 28.2 | 35.7 | 25.2 | 35.8 | 24.7 | 15.4 | 47.3 | 31.3 |
10 | LLaVA (LLaMA-2-13B) | LMM 🖼️ | Link | 2023-10-03 | 25.40 | 22.9 | 24.6 | 18.1 | 35.8 | 29.7 | 26.9 | 22.5 | 24.4 | 19.1 | 19.1 | 34.7 | 21.6 |
* | Random Chance | - | Link | 2023-10-03 | 17.86 | 15.5 | 24.1 | 4.5 | 23.4 | 24.3 | 25.8 | 13.8 | 22.7 | 13.4 | 8.8 | 15.8 | 14.3 |
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
🚨 For more submission details, please refer to this link and this link.
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
Examples of our newly annotated datasets: IQTest, FunctionQA, and PaperQA.
Summary of the 31 different source datasets in MathVista.
All the data examples were divided into two subsets: testmini and test.
Key statistics of
MathVista.
Source dataset distribution of
MathVista.
FQA: figure question answering,
GPS: geometry problem solving,
MWP: math word problem,
TQA: textbook question answering,
VQA: visual question answering.
One example for each mathematical reasoning skill required in MathVista
Arithmetic Reasoning
Algebraic Reasoning
Geometric Reasoning
Logical Reasoning
Numeric Reasoning
Statistical Reasoning
Scientific Reasoning
One example for each visual context type required in MathVista
Geometry Diagram
Synthetic Scene
Bar Chart
Natural Image
Scientific Figure
Table
Function Plot
Abstract Scene
Puzzle Test
Scatter Plot
Line Plot
Pie Chart
Document Image
Medical Image
Others
Notable statistics of MathVista
Distribution of visual context types within MathVista
Category distribution of problems within MathVista
Distribution of questions across different grade levels within MathVista
Distribution of the number of words per question in MathVista.
Portion of each mathematical reasoning type involved in the problems of MathVista
Distribution of the number of mathematical reasoning types within MathVista
Task type distribution of problems within MathVista
Accuracy scores of primary baselines on the testmini subset (1,000 examples) of
MathVista.
Both CoT GPT-4 and PoT GPT-4 are augmented with Bard captions and OCR text.
Error analysis of Bard results: (a) presents errors in answers and explanations;
(b) delves into the details of wrong explanations.
Notations: “Answer” is “Ans.”, “Explanation” is “Exp.”, “Partially Correct” is “Partial”,
and “Not applicable” refers to unanswerable or indeterminate cases.
Average accuracy scores across different grade levels for leading foundation models
Accuracy scores of leading baselines across various visual contexts
Average accuracy scores of LLM baselines under various visual inputs
Explore the outputs of each model on MathVista
@inproceedings{lu2024mathvista,
author = {Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},
title = {MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},
booktitle={International Conference on Learning Representations (ICLR)},
year = {2024}
}