Benchmarking LLMs on complex data structures: Who really understands Tomato model-based testing?

Recently, we put four very different LLMs through a grueling testing gauntlet to see if they can move beyond simple chat and handle complex, structured data modeling.

The benchmark tested their ability to generate logic nodes, handle deep structural nesting and flattening in YAML, inject localized mock data, and parse multi-modal inputs (Images and PDFs) directly into parameter attributes.

Here is the breakdown of how they performed:

OpenAI

GPT-4.1
The industry-standard, cloud-based baseline
Pros: It is a highly consistent, fast baseline that easily outperforms others in bulk generation. It handles multi-modal tasks, achieving perfect scores in image parsing, PDF text extraction, and XML model recreation. Additionally, it perfectly executes structural manipulation (effortlessly flattening or structuring hierarchical models) and demonstrates flawless debugging and general knowledge capabilities.
Cons: It stumbles on localization by hallucinating foreign addresses. It also requires heavy iterative prompting to build highly complex combinatoric YAML models from scratch. When simplifying logic, its outputs technically work but, in some cases, remain unnecessarily convoluted. Finally, its performance drops significantly when creating logic nodes within nested structures, occasionally resulting in multiple nodes of the same type.

Google

Gemini 3 Flash Preview
A highly capable model accessible via its generous free tier.
Pros: Gemini is a powerhouse for deep logic and structure, uniquely building the advanced combinatorial geopolitics model perfectly on its very first try. It handles multi-modal tasks, acing image parsing, PDF and XML data extraction. Furthermore, it easily masters structural manipulation and logic simplification, while demonstrating perfect debugging and general knowledge capabilities.
Cons: The model is painfully slow, significantly trailing behind every other model in raw performance and bulk generation tasks. Furthermore, it may completely fail localization tests. Additionally, while its logic is robust, it occasionally creates redundant nodes of different types and shows minor inconsistencies when generating complex nested logic.

Anthropic

Claude Sonnet 4.6
The major GPT alternative, renowned for its reasoning and coding prowess.
Pros: Claude handles logic simplifications incredibly well, perfectly streamlining complex nodes without altering their meaning. It matches the baseline in structural manipulation, successfully flattening and structuring hierarchical models, and aces complex PDF/XML data extraction. Additionally, it demonstrates exceptional debugging skills by perfectly identifying and correcting injected errors, and performs excellently in general knowledge generation tasks.
Cons: Like the baseline, it struggles to build the advanced combinatorial geopolitical model from scratch, requiring heavy iterative prompting to get the structure right. It stumbles on localization by hallucinating foreign addresses. Furthermore, its accuracy drops noticeably when generating nested logic nodes, occasionally creating redundant nodes of different types.

Ollama

Ministral 3:8b
A lightweight, on-premise option for companies with strict data privacy.
Pros: Ollama is incredibly fast and operates entirely locally. Remarkably, despite being the smallest model, it was the only one to successfully nail the complex foreign localization address test. It performs quite well when generating name suggestions, utilizes general knowledge effectively, and is capable of extracting data from pure text PDFs.
Cons: However, it heavily struggles with structured output and following strict formatting rules. It frequently fails to maintain valid YAML syntax during complex nesting, advanced model generation, and multi-modal tasks. Consequently, it completely fails at structural manipulation, proving unable to flatten or structure hierarchical models. It also scores absolutely zero when tasked with identifying and correcting injected model errors.

Generating text is a solved problem, but generating flawless, nested, logically constrained code structures is still a fierce battleground.