Hugging Face Evaluate: The One-Liner Everyone Gets Wrong
Hugging Face's evaluate library promises a BLEU score in one line of Python — and it delivers. The catch is that it is easy to compute a BLEU number that is quietly wrong, because the references argument has a shape that isn't the obvious one. Get the shape right and evaluate is the fastest path to a trustworthy metric. Get it wrong and you ship a number that means nothing. Here is the correct usage, task by task, plus an honest take on where the library belongs in 2026.
TL;DR
evaluateloads dozens of evaluation modules with one line and scores them withcompute().- Three kinds of module: metrics (predictions vs references), comparisons (two models), measurements (dataset properties).
- The trap: BLEU/sacreBLEU
referencesis a list of lists of strings, not a flat list — one or more references per prediction.combine(), theevaluator, andEvaluationSuitecover everything past a single metric.- For LLM evaluation, Hugging Face now points to LightEval;
evaluateis the classic-metrics layer, not deprecated.
The trap, first
Start with the mistake, because it is the reason this library so often gets looked up twice.
import evaluate
bleu = evaluate.load("bleu")
# WRONG — references as a flat list of strings
bleu.compute(predictions=["hello there general kenobi"],
references=["hello there general kenobi"]) # error or wrong score
# RIGHT — references as a list of *lists* of strings
bleu.compute(predictions=["hello there general kenobi"],
references=[["hello there general kenobi", "hello there!"]])The reason is linguistic, not arbitrary: a translation can have several equally valid references, so BLEU is defined against a set of references per prediction. predictions is a flat list of candidate strings; references is a list where each entry is itself a list of acceptable strings. Once that clicks, the rest of the library is genuinely one line at a time.
What the evaluate library is
evaluate is a library for evaluating ML models and datasets, exposing "dozens of popular metrics" through one uniform API. It operates on plain Python lists, so it is framework-agnostic — it works with PyTorch, TensorFlow and scikit-learn outputs alike. Its modules come in three kinds:
- Metrics — predictions vs ground truth (accuracy, F1, BLEU, ROUGE, METEOR, exact_match).
- Comparisons — two models against each other.
- Measurements — properties of a dataset (text complexity, label distribution).
Most day-to-day use is metrics, and the pattern is always the same: evaluate.load("name"), then .compute(...), which returns a dictionary.
Recipe 1 — Score a classifier (accuracy, F1)
Classification metrics take flat lists of labels:
accuracy = evaluate.load("accuracy")
accuracy.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 0])
# {'accuracy': 0.75}
f1 = evaluate.load("f1")
f1.compute(predictions=[0, 1, 1], references=[0, 1, 0], average="macro")Recipe 2 — Score a translation (BLEU / sacreBLEU)
The one to memorize — predictions is a list of strings; references is a list of lists of strings:
bleu = evaluate.load("bleu")
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi", "hello there!"], ["foo bar foobar"]]
bleu.compute(predictions=predictions, references=references)
# sacreBLEU uses the same list-of-lists shape, but — unlike bleu — it requires the
# SAME number of references for every prediction, so pad to equal length:
sacrebleu = evaluate.load("sacrebleu")
sacrebleu.compute(
predictions=predictions,
references=[["hello there general kenobi", "hello there!"],
["foo bar foobar", "foo bar foobar"]],
)
# {'score': ..., 'counts': [...], 'totals': [...], 'precisions': [...],
# 'bp': ..., 'sys_len': ..., 'ref_len': ...}Pass a flat list of reference strings and, depending on the metric, you get either an error or a silently wrong score — the most common mistake with this library. (And note the asymmetry above: bleu accepts a different number of references per prediction; sacrebleu does not.)
Recipe 3 — Score a summary (ROUGE) and exact strings
rouge = evaluate.load("rouge")
rouge.compute(predictions=["the cat sat on the mat"],
references=["the cat sat on the mat"])
# {'rouge1': ..., 'rouge2': ..., 'rougeL': ..., 'rougeLsum': ...}
em = evaluate.load("exact_match")
em.compute(predictions=["Paris"], references=["Paris"]) # {'exact_match': 1.0}ROUGE is recall-oriented and intended for summarization; predictions and references are lists of strings.
Recipe 4 — Several metrics at once (combine)
preds, refs = [0, 1, 1, 0], [0, 1, 0, 0] # same flat label lists as Recipe 1
clf = evaluate.combine(["accuracy", "f1", "precision", "recall"])
clf.compute(predictions=preds, references=refs) # one dict with all fourRecipe 5 — Evaluate a whole model (evaluator)
The evaluator runs a model or pipeline over a dataset and metric end-to-end, so you skip the manual loop:
task = evaluate.evaluator("text-classification")
task.compute(model_or_pipeline="distilbert-base-uncased-finetuned-sst-2-english",
data="imdb", metric="accuracy")For many tasks at once, EvaluationSuite bundles (evaluator, dataset, metric) tuples so one model is scored across them, with the suite stored on the Hub.
What it does NOT have
A frequent search is evaluate.load("information_retrieval") — there is no such built-in module. The core library is classic-metric territory; IR measures like recall@k, MRR and nDCG come from other libraries or community modules on the Hub, not from evaluate itself. Knowing the boundary saves an afternoon.
Where it fits in 2026
evaluate is excellent at standardized, reproducible metric computation, and it is not going away. But for LLM evaluation, Hugging Face's own focus has shifted to LightEval — a more actively maintained toolkit with 1000+ tasks, from Hugging Face's Leaderboard & Evals team; the evaluate docs themselves now point there for recent LLM work. Read it as a division of labor: evaluate is the classic-metrics layer; LightEval is the LLM-eval layer. Neither is deprecated.
The deeper point is what these metrics can't see. BLEU and ROUGE score string overlap; accuracy and F1 score labels. None of them can tell you whether a model reasoned correctly, used a tool properly, or remembered a fact from three sessions ago — the same blind spot the broader agent-evaluation tooling leaves open, and the reason evaluating an agent's memory needs a different instrument.
Common questions
What is the Hugging Face evaluate library? A Python library for one-line evaluation of ML models and datasets — evaluate.load() a module, compute() a score. Modules are metrics, comparisons, or measurements.
What is the predictions/references format for BLEU/sacreBLEU? predictions is a list of strings; references is a list of lists of strings (≥1 reference per prediction). sacreBLEU returns score, counts, totals, precisions, bp, sys_len, ref_len.
How do you use evaluate.load and compute()? m = evaluate.load("bleu"); m.compute(predictions=preds, references=refs) → a dict of scores.
Is there an information_retrieval metric? No — not a built-in module; use other tools or community modules for recall@k / MRR / nDCG.
Should I use evaluate for LLM evaluation? For classic metrics, yes; for LLMs, Hugging Face now points to LightEval. evaluate is not deprecated, just no longer the LLM-eval focus.
Related
- LangChain & LangSmith Evaluation: The Memory Blind Spot — the agent-eval tooling landscape this sits in
- LLM-as-a-Judge: Bias, Leniency & the LoCoMo Number — when a metric is another model's opinion
- How to Evaluate AI Agent Memory — what string-overlap metrics can't measure
Mnemoverse is a persistent-memory API for AI agents. Free key: console.mnemoverse.com · Docs: Getting Started
