MLLM-as-a-Judge: Multimodal evaluators for image-to-text CI with Strands Evals
- TL;DR
- Multimodal evaluators (MLLM-as-a-Judge) let your CI check image-to-text outputs by ingesting the actual image plus the text output—closing the gap text-only critics leave open.
- Strands Evals adds four multimodal judges: Overall Quality (Likert 1–5), Correctness (binary), Faithfulness (binary), and Instruction Following (binary). Judges run on Amazon Bedrock (Anthropic Claude Sonnet 4.6 is the recommended default).
- Best practices: let the judge “reason before scoring,” use a few-shot calibration, and prefer multi-dimensional rubrics. Log score + reason to make CI failures actionable.
What these terms mean, fast
- MLLM: a multimodal large language model that can read both images and text.
- Multimodal evaluator: an automated judge that consumes image + text to score outputs (e.g., captions, table extracts, chart reads).
- Reference-based vs reference-free: reference-based compares an output to a ground-truth label; reference-free validates format, structure, or whether content is grounded without a canonical answer.
- Likert 1–5: a graded quality scale where 1 = poor and 5 = excellent.
- Reason-before-scoring: prompt pattern asking the model to produce its chain-of-thought or verification steps before emitting the final score—this improves alignment with human judgments.
The problem: pixels are the ground truth
If you need to verify that a caption, a table extraction, or a chart read is actually grounded in an image, a text-only critic won’t do. The ground truth lives in the pixels. Text-only evaluation pipelines that rely on auto-generated image descriptions miss visual hallucinations, invented fields, and formatting errors that are visible only when a judge can literally look at the image.
The practical fix: MLLM-as-a-Judge
Strands Evals now includes four multimodal evaluators that consume images plus text and return a numeric score plus a human-readable reasoning string. They run on Amazon Bedrock (defaulting to Anthropic Claude Sonnet 4.6 in Strands experiments) and support both reference-based and reference-free modes. That combination gives teams a diagnostic, CI-friendly way to catch visual hallucinations, correctness issues, and instruction violations automatically.
How it works (high level)
Strands Evals follows a Case → Experiment → Report workflow. For image-to-text evaluation you:
- Create a Case with the source image, expected output (optional), and the system output.
- Run an Experiment that invokes a multimodal evaluator on Bedrock; input includes MultimodalInput/ImageData and the candidate text.
- Collect a Report where each run returns a score and a reasoning string for debugging and trend analysis.
Why multimodal judges beat text-only proxies
- Direct grounding: judges that see the image can verify whether a number, object, or label actually appears.
- Better alignment: experiments show multimodal judges align substantially better with human scores than judges fed auto-generated captions or OCR text alone.
- Actionable logs: every evaluator returns a reasoning string alongside the score, making CI failures diagnosable without re-running expensive tests.
“A text-only evaluator can’t verify whether a caption or extracted field is actually grounded in the source image — the ground truth lives in the image.”
Design lessons that matter
- Reason-before-scoring helps most: asking the judge to show its verification steps before giving a score materially improves judge-to-human alignment.
- Few-shot calibration works: providing a handful of labeled examples tunes the judge’s decision boundary toward your rubric and domain.
- Multi-dimensional rubrics are more useful: a single holistic score hides failure modes—separate metrics for correctness, faithfulness, and instruction following reveal where to fix the model or the pipeline.
- Reference-based when you can: use references for content-grounded checks (correctness, faithfulness). Use reference-free for structural or format checks (instruction following).
Short vignette: chart-reading that needed verification
On a chart-reading task (ChartQA/Statista), a system reported “U.S. and Canada — $13.32.” A multimodal Overall Quality evaluator and the three binary evaluators—Correctness, Faithfulness, Instruction Following—each returned perfect scores and a short reasoning trace: the judge located the series label, matched the bar value visually, and confirmed the textual format matched the instruction. That reason+score pair was logged to CI and resolved an intermittent extraction bug without a manual rerun.
Sample judge reasoning (sanitized)
Found label “U.S. and Canada” on the legend; identified bar corresponding to that label; read the bar value as $13.32 from the y-axis tick alignment; output format matches required “Region — $Value”. Score: 5/5.
Quick start for teams (three steps)
- Wire Overall Quality (Likert 1–5) into a sampled CI pipeline for broad coverage and sanity checks.
- When failures cluster, add the binary evaluators—Correctness, Faithfulness, Instruction Following—to diagnose whether the issue is factual, visual hallucination, or a format violation.
- Log both score and the reasoning string; use the reasoning to triage fixes without rerunning costly tests.
Prompt template & few-shot tip
Ask the judge to explain its verification steps, then give a final score. Few-shot examples should mirror your expected failure modes and score boundaries.
Prompt template (conceptual):
“Given the image and the model output, verify whether the output is grounded in the image. First, list the evidence (what you see in the image). Then, check the output against that evidence and the optional reference. Finally, produce a score and a one-sentence justification.”
Include 3–5 calibration examples where the correct score and reasoning are shown. That guides the judge’s interpretation of edge cases (partial credit, formatting quirks, ambiguous visuals).
Example rubric: chart-reading
- Correctness (binary): Does the extracted value match a visually verifiable number in the chart?
- Faithfulness (binary): Is any content in the output invented or unsupported by the image?
- Instruction Following (binary): Does the output follow the required format, labels, and units?
- Overall Quality (1–5): 1 = wrong and misleading, 3 = partially correct or ambiguous, 5 = accurate, clear, and properly formatted.
Operational considerations & mitigations
- Cost & latency: Multimodal judges require an extra model call. In Strands experiments with Bedrock, accounting for that call made multimodal judging competitive compared to multi-step text-only pipelines. Mitigation: sample inputs, batch evaluations, use lower-tier judges for bulk checks, and reserve the high-quality judge for edge cases.
- Privacy & compliance: Sending images to hosted MLLMs raises data residency and policy concerns. Options: use Bedrock VPC endpoints where available, redact or hash sensitive regions, run evaluations on private or on-prem multimodal models when needed.
- Adversarial and OOD inputs: Judges can be fooled—models may learn to game a specific judge. Countermeasures: diversify your judge models, rotate calibration examples, and include adversarial test cases in your CI sample set.
- Scaling: Large-scale continuous evaluation requires sampling strategies (stratified by model confidence, by user impact, or by domain), caching repeated images, and tiered judging with cheaper proxies for low-risk checks.
When to use reference-free checks
Use reference-free for structural or format validations where a canonical answer doesn’t exist (e.g., “Did the output include the three required fields?”). Use reference-based when you have a ground-truth label to verify factual correctness or faithfulness to the image.
Questions leaders ask
- Do judges need to see the image to be effective?
Yes. Multimodal judges that receive the image align materially better with human scores than text-only proxies that rely on generated descriptions.
- Which evaluator should I start with?
Start with Overall Quality (Likert 1–5) for broad sanity checks, then add Correctness, Faithfulness, and Instruction Following for targeted diagnostics.
- Which model to run on Bedrock?
Anthropic Claude Sonnet 4.6 was the recommended default in Strands experiments for a strong accuracy-to-cost balance. Evaluate alternatives to match your budget and latency needs.
- How do I make CI failures actionable?
Log both the numeric score and the human-readable reasoning string. Use the reason trace to triage fixes without rerunning the full test suite.
Next-step checklist
- Wire Overall Quality into a sampled CI pipeline and log reason+score.
- Add binary evaluators where failures cluster and build a small calibration set for few-shot tuning.
- Design a sampling and tiering strategy to control cost/latency at scale.
- Audit privacy implications and decide whether to use cloud-hosted judges or private/on-prem alternatives.
- Include adversarial and OOD examples in your CI to detect gaming or brittle behavior.
Multimodal evaluators don’t eliminate every risk, but they close a crucial gap: your test harness can now look at the same ground truth humans use—the image—and verify whether language outputs are actually anchored to it. For teams shipping vision+language features, that change turns brittle “sounds-right” systems into verifiably correct, debuggable production components.