Croissant Tasks: Machine-Actionable Metadata for Reproducible ML Evaluations

TL;DR — Reproducibility is still a critical challenge in machine learning: source code is withheld, execution details are underspecified, and software environments rot. Croissant Tasks is a declarative, machine-actionable metadata format that represents benchmarks and competitions as structured data, formally decoupling the task problem from the task solution. With it, autonomous agents can ingest a specification and generate a working reproduction pipeline from scratch, matching published baselines without ever seeing the original code. The work is a collaboration across Google DeepMind, ChaLearn, Université Paris-Saclay, Jetty and Mila, Helmholtz Munich and the Technical University of Munich, Eindhoven University of Technology, and Brickroad, and it builds on the MLCommons Croissant dataset standard.

Why benchmarks are hard to reproduce

Benchmarks and competitions are the primary yardstick the ML community uses to measure progress. Yet reproducing them is notoriously painful. The reasons are structural rather than accidental: source code is often withheld, critical execution details such as hyperparameters, preprocessing, or evaluation prompts are underspecified, and implementations are tied to specific, rapidly obsolete software environments. A substantial gap exists between a paper's high-level conceptual contribution and the brittle, low-level details required to execute it.

The community has reached for many remedies — reproducibility checklists, model cards, badges, and asset-sharing platforms like Hugging Face, OpenML, and Codabench. These all help with documentation and sharing, but none provide a formal, standardized representation of the execution flow itself. Reproduction therefore stays dependent on ad-hoc setups rather than automated execution.

From technical replication to conceptual reproducibility

Recent autonomous coding agents enable a paradigm shift. Instead of demanding strict access to the original source code and its transient environment, we can describe an evaluation at a high level and let a modern agent synthesize a fresh implementation. This moves the goalpost:

Technical replication — run the exact original code in an identical environment to recover bit-wise identical numbers. High dependency on framework, OS, and hardware; brittle under minor shifts.
Conceptual reproducibility — verify the scientific claim through an independent implementation derived from a specification. Platform-agnostic, and arguably a more rigorous test of the underlying claim.

Formally, technical replication verifies a single path, while conceptual reproducibility asserts that any valid instantiation of the abstract problem and solution leads to the same conclusion.

The Croissant Tasks vocabulary

Croissant Tasks is built on Semantic Web technologies. It extends schema.org and integrates with the Croissant Datasets vocabulary to describe inputs and outputs, reusing the cr prefix. The central class is cr:Task, a unit of work with a handful of core properties:

Property	Description
`cr:input`	The data consumed by the task.
`cr:output`	The data produced by the task.
`cr:implementation`	The model, system, or code that performs the task.
`cr:execution`	Computing environment, resources, and dependencies.
`cr:evaluation`	Computed metrics and their values.
`cr:subTask`	Subtasks that are part of the definition.

Here is a self-contained cr:Task describing a single MMLU evaluation run, including its result:

{
  "@context": {
    "ex": "http://example.org/",
    "cr": "http://mlcommons.org/croissant/",
    "sc": "https://schema.org/"
  },
  "@type": "cr:Task",
  "@id": "ex:mmlu_small_fewshot",
  "sc:name": "MMLU Task - Small Model (Few-shot)",
  "cr:input": {
    "@type": "sc:Dataset",
    "@id": "https://huggingface.co/datasets/cais/mmlu",
    "sc:name": "MMLU Dataset on Hugging Face"
  },
  "cr:implementation": {
    "@type": "sc:SoftwareApplication",
    "sc:name": "OpenAI GPT API - Small"
  },
  "cr:evaluation": {
    "@type": "cr:EvaluationTask",
    "cr:evaluationResults": [
      {
        "@type": "cr:EvaluationResult",
        "cr:metric": "Accuracy",
        "cr:value": "25.9",
        "sc:description": "Overall Average Accuracy"
      }
    ]
  }
}

Decoupling problems from solutions

The key design move is to separate the problem specification from the submitted solution — mirroring the social structure of science, where one group defines a challenge and others propose independent solutions over time. Croissant Tasks introduces two subclasses of cr:Task:

cr:TaskProblem — a problem definition. It mixes concrete "givens" with placeholders (Specs) for the components a solution must fulfill. A cr:OutputSpec can pin a target output schema; a cr:EvaluationSpec can define which metrics to compute.
cr:TaskSolution — a concrete response that references the problem via sc:isBasedOn and fills in the placeholders with real data, code, or results.

Any component — input, output, implementation, or evaluation — can be a concrete value in the problem or a placeholder for the solution. This unlocks several use cases:

Use case	Problem givens	Problem specs	Solution provides
Model benchmarks	Input dataset	Output schema, model, metrics	Model, output, evaluation results
Performance benchmarks	Code or model, input, output	Metrics	Execution environment, results
Coding competitions	Input data	Expected output, metrics	Implementation, results

Crucially, this formal structure lets automated tools verify that a solution "matches" its problem by fulfilling every specified requirement.

Does it actually work?

The paper evaluates two questions on a sample of five benchmark papers accepted as orals or spotlights at the NeurIPS 2025 Datasets and Benchmarks track — AbsenceBench, CoRe, MedSG-Bench, NOVA, and SAGE-Eval — spanning text, code, medical imaging, brain MRI, and safety reasoning.

Expressivity (can an agent extract a Croissant Task from a paper?) An autonomous agent generated task descriptions from each paper, validated against SHACL shapes and reviewed by human experts.

	Absence	CoRe	MedSG	NOVA	SAGE	Average
Field coverage (%)	100	87	100	100	100	97.4

The single dip (CoRe, 87%) was an extraction miss of three hyperparameters — a failure of the extraction process, not the format.

Agentic reproduction (is the spec enough to rebuild the evaluation?) A coding agent generated implementations from scratch under two settings: given only the paper PDF, versus given only the Croissant Task file. Web access was monitored to ensure the agent never retrieved the paper's own code.

Setting	Absence	CoRe	MedSG	NOVA	SAGE	Average
Croissant Tasks only (%)	100	100	100	85.7	100	97.1
PDF only (%)	100	100	100	50	100	90

The compact, structured specification outperformed the full PDF — the PDF occasionally caused the agent to terminate early from context overload, while the Croissant Task provided a clear blueprint with far smaller context consumption. The generated files and code are available in the MLCommons Croissant repository.

Beyond reproducibility

The same structure pays off in three other dimensions. Interoperability: a benchmark defined once can be hot-swapped across harnesses like HELM or lm-evaluation-harness and platforms like Codabench, without bespoke adapter code. Evolution and reuse: because the format is declarative rather than procedural, an agent can regenerate an implementation when a library goes obsolete, and tasks can be versioned as they grow. Discoverability: extending schema.org means tasks inherit web discoverability, enabling custom search engines and indexes that both humans and agents can query.

Challenges remain — adoption needs editors, validators, and institutional incentives (e.g., conference checklists), and metadata fidelity is essential to set the tolerances that functional reproducibility relies on. But the direction is clear: shifting from brittle technical replication toward verifiable, high-level specifications lets the community move faster on a foundation of trust.

Paper and authors

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations. Available as an arXiv preprint (arXiv:2605.29786).

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, and Joaquin Vanschoren.

Affiliations: Google DeepMind, ChaLearn, Université Paris-Saclay, Jetty, Mila (Quebec AI Institute), Helmholtz Munich, the German Center for Diabetes Research, the Technical University of Munich, Helmholtz AI, Brickroad, and Eindhoven University of Technology.