Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Source: arXiv AI Papers

Chart-to-code generation requires complex reasoning over visual charts to produce structured code, a task where traditional supervised fine-tuning (SFT) methods often plateau despite scaling data. To address this, the authors constructed the largest training corpus to date with 3 million real-world chart-code pairs from arXiv tables, moving beyond simplistic synthetic datasets. Their experiments reveal that simply increasing SFT data eventually yields diminishing returns, highlighting the need for more effective training strategies.

The proposed Multimodal Structured Reinforcement Learning (MSRL) method introduces a multi-granularity reward system that integrates both textual and visual feedback. Textual rewards use rule-based validations to ensure fine-grained code accuracy, while visual rewards assess structural similarity by rendering generated code into images and evaluating them with a model-based approach. This dual feedback mechanism is implemented within a two-stage curriculum to maintain training stability.

MSRL demonstrates significant improvements, breaking the SFT plateau with 6.2% and 9.9% gains on the ChartMimic and ReachQA benchmarks, respectively. These results indicate that structured reinforcement learning with multimodal feedback can effectively enhance chart-to-code generation beyond what is achievable with SFT alone. The approach also achieves competitive performance compared to advanced closed-source models, suggesting its potential for broader adoption in tasks requiring structured output generation from complex visual inputs. Future work may explore further refinement of reward mechanisms and application to other multimodal reasoning tasks.

👉 Pročitaj original: arXiv AI Papers