FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning

Paper Code 🤗 Dataset 🏆 Leaderboard 🔍 Visualization
Introduction

Multimodal Large Language Models (MLLMs) have made substan tial progress in recent years. However, their rigorous evaluation withinspecializeddomainslikefinanceishinderedbytheabsenceof datasets characterized byprofessional-level knowledgeintensity,de tailed annotations, and advanced reasoning complexity. To address this critical gap, we introduce FinMR, a high-quality, knowledge intensive multimodal dataset explicitly designed to evaluate expert level financial reasoning capabilities at a professional analyst’s standard. FinMR comprises over 3,200 meticulously curated and ex pertly annotated question-answer pairs across 15 diverse financial topics, ensuring broad domain diversity and integrating sophisti cated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across multiple image types. Through comprehensive benchmarking with leading closed-source and open-source MLLMs, we highlight significant performance disparities between these models and professional financial ana lysts, uncovering key areas for model advancement, such as precise image analysis, accurate application of complex financial formu las, and deeper contextual financial understanding. By providing richly varied visual content and thorough explanatory annotations, FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.

Overview
Our Dataset
Dataset
Evaluation Process
Dataset

Four stages of the evaluation process

Evaluation Results of Large Multimodal Models (LMM)
Evaluation

FinMR provides diverse visual data, as shown in panel (a). Evaluation of financial reasoning abilities of LLMs and MLLMs covers mathematical and expertise-based tasks (see panel (c)), and performance varies across 15 financial domain topics (see panel (d), the abbreviation list of topics provided in leaderboard). The key errors shown in panel (b) categories include image recognition failures, incorrect formula application, and question misunderstanding.

Leaderboard

Accuracy scores on the dataset