Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Aug 18, 2025

Link - https://arxiv.org/html/2306.05685v4

This is one of those academic papers every product builder should read. The concept of using LLMs as judges now powers a wide range of products.

Brief Summary

There are 3 version of LLM-as-a-judge

Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie.

Single answer grading. Alternatively, an LLM judge is asked to directly assign a score to a single answer.

Reference-guided grading. In certain cases, it may be beneficial to provide a reference solution. This is applicable to both
- Pairwise comparison and
- Single-answer grading

Trade-offs between the strategies

Method: Pairwise Comparison

Pros

More stable and reliable than absolute scores (less calibration noise)
Less sensitive to verbosity and score scale drift
Aligns closely with human preference setups

Cons

Not scalable: requires O(n²) comparisons to cover many systems
Expensive for >10–20 systems without careful sampling
Still susceptible to position bias (must randomize sides)
Harder to capture absolute correctness (good only for relative ranking)

Method: Single-Answer Grading

Pros

Easiest to implement: single prompt, single answer
Most scalable to 100s–1000s of instances in minutes
Can be automated cheaply across many datasets
Flexible: can be used for open-ended, multi-turn tasks

Cons

Prone to biases (verbosity, self-preference, style)
Prompt injection / adversarial phrasing can manipulate judge
No grounding if rubric is vague → unstable scores
Poor calibration: different prompts/models give inconsistent scales

Method: Reference-Guided Grading

Pros

Improved stability
Better at detecting faithfulness and correctness (esp. RAG)

Cons

Can be over-strict with creative answers that differ from reference
Works best for closed or semi-open tasks (summaries, factual QA)
Reference quality limits ceiling of evaluation
Increases cost (prompt length)

Biases and their resolution

Method: Pairwise comparison

Position bias is when an LLM exhibits a propensity to favor certain positions over others

The above image shows an example of position bias. GPT-4 is tasked to evaluate two responses from GPT-3.5 and Vicuna-13B to an open-ended question. When GPT-3.5’s answer is positioned first, GPT-4 considers GPT-3.5’s answer more detailed and superior. However, upon switching the positions of the two responses, GPT-4’s judgement flips, favoring Vicuna’s answer.

Solution

The position bias can be addressed by simple solutions. A conservative approach is to call a judge twice by swapping the order of two answers and only declare a win when an answer is preferred in both orders. If the results are inconsistent after swapping, we can call it a tie. Another more aggressive approach is to assign positions randomly, which can be effective at a large scale with the correct expectations

Method: Single-answer grading

Self-enhancement bias

LLM judges may favor the answers generated by themselves.

For example, GPT-4 favors itself with a 10% higher win rate; Claude-v1 favors itself with a 25% higher win rate. However, they also favor other models and GPT-3.5 does not favor itself.

Solution

Don’t let one model be both contestant and judge.
Use a different model family (e.g., Claude to judge GPT, or Gemini to judge Claude).
Even better: ensemble of heterogeneous judges and aggregate (majority vote / weighted Elo).

Verbosity bias is when an LLM judge favors longer, verbose responses, even if they are not as clear, high-quality, or accurate as shorter alternatives.

Solution

Explicitly tell the judge not to reward length
“Do not consider verbosity, only correctness, relevance, and conciseness.”
Enforce word/character limits on model outputs before scoring.

The Zen AI Consultant

Discussion about this post