DeepEval: pytest-style LLM evals, with a catch
TL;DR
- DeepEval is an open-source LLM evaluation framework by Confident AI, framed as “Pytest for LLMs,” and that framing holds up because it turns evals into executable tests you can run in CI. GitHub, docs
- Its core strength is not a magic score. It is the workflow: define
LLMTestCases, attach metrics, rundeepeval test run, and fail the build when quality drops. getting started, CI docs- Most of DeepEval’s metric surface is LLM-as-a-judge, including G-Eval, so scores inherit judge variance, leniency, and bias. The G-Eval paper itself reports a Spearman correlation of 0.514 with human judgment on summarization and flags bias toward LLM-generated text. DeepEval metrics docs, G-Eval paper, ACL Anthology
- The practical answer is to use explicit thresholds, inspect reasons, and prefer deterministic structures like DAG when a single judge score is too unstable to trust. metrics intro, DAG docs
DeepEval is an open-source LLM evaluation framework by Confident AI that packages LLM testing in a pytest-like workflow for prompts, RAG pipelines, agents, and conversational systems. GitHub, docs
At the time of writing, the project is Apache 2.0 licensed, requires Python 3.9 or newer, and has roughly 16k GitHub stars as of June 2026. Star counts drift, so that number should be read as dated context, not a fixed fact. GitHub
The cleanest way to understand DeepEval is this: it treats quality checks as test code, not as a notebook exercise and not as a dashboard ritual you remember to run later.
What DeepEval actually gives you
The core object is LLMTestCase.
LLMTestCase is DeepEval’s basic test record: it holds the prompt input, the model’s actual output, and optional expected output or retrieval context, while each metric reads only the fields it needs. getting started
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days to get a full refund at no extra cost.",
expected_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers get a 30-day full refund at no extra cost."],
)That object matters because it gives evals a common contract. A correctness metric may need expected_output. A faithfulness metric may need retrieval_context. The framework does not force one giant schema onto every evaluation problem. getting started
DeepEval’s metric contract is also simple. Every metric returns a score from 0 to 1, a threshold that defaults to 0.5, and a self-explaining reason. A test passes when the score is at least the threshold. metrics intro
That sounds small, but it is the difference between “the answer looks worse” and “the build failed because faithfulness scored below threshold, and here is the reason.”
DeepEval G-Eval: flexible, useful, and not fully stable
G-Eval is an LLM-as-a-judge metric that scores outputs against custom criteria by first deriving evaluation steps and then judging the output against those steps. DeepEval metrics docs
This is DeepEval’s most flexible metric. If your evaluation question is not covered by a narrow built-in metric, G-Eval is often the first tool you reach for.
from deepeval.metrics import GEval
from deepeval.test_case import SingleTurnParams
correctness = GEval(
name="Correctness",
criteria="Is the actual output factually correct given the expected output?",
evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
threshold=0.5,
)DeepEval explicitly ties this metric to the research paper G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In that paper, Liu et al. report a Spearman correlation of 0.514 with human judgment on summarization. The same paper also warns that LLM judges show bias toward LLM-generated text. arXiv, ACL Anthology
That is the honest catch in DeepEval as a whole. Most of its metrics are judge-based. So if you treat a single score as ground truth, you can build false confidence into your release process.
If you want a fuller treatment of that problem, see LLM-as-a-Judge: Bias, Leniency & the LoCoMo Number. It is the right companion piece because DeepEval makes judge-based evaluation operational, but it does not make judge variance disappear.
DeepEval DAG metric: the deterministic escape hatch
DAG is DeepEval’s “Deep Acyclic Graph” metric: a deterministic decision-tree evaluation where the final score comes from the path through explicit judgment nodes, not from a judge model directly assigning a number. DAG docs
This is DeepEval’s most important idea after the pytest-style harness.
Instead of asking a model to emit one broad score for “quality,” you decompose the judgment into smaller decisions through node types like TaskNode, BinaryJudgementNode, NonBinaryJudgementNode, and VerdictNode. The tree is traversed in order, and the score is tied to that path. DAG docs
Concretely, a check on that refund reply might decompose like this:
[TaskNode: analyze the support reply]
│
▼
[BinaryJudgementNode: is the refund policy mentioned?]
├── Yes ──> [BinaryJudgementNode: is the 30-day timeline accurate?]
│ ├── Yes ──> [VerdictNode: score 1.0]
│ └── No ──> [VerdictNode: score 0.3]
└── No ──> [NonBinaryJudgementNode: is an alternative offered?]
├── Exchange ──> [VerdictNode: score 0.6]
└── None ──> [VerdictNode: score 0.0]The model still answers at each node, but it only ever makes one small, explicit decision at a time. The number comes from where the input lands in the tree, not from a judge weighing everything at once.
When to reach for which:
| G-Eval | DAG | |
|---|---|---|
| Authoring | flexible, fast to write | structured, more design work |
| Where the score comes from | the judge names a number | the path through the tree |
| Judge variance | more room for it | narrowed to one decision per node |
| Best for | open-ended criteria, quick coverage | CI gating where a single score must be trusted |
So DAG is not “purely deterministic” in the sense of removing all model judgment. It is deterministic in how the evaluation is structured and scored once the node logic is defined. That makes it a better fit when a single top-level score feels too soft for CI gating.
DeepEval RAG triad: finding whether retrieval or generation broke
The RAG case is where DeepEval feels most practical.
DeepEval’s RAG triad consists of answer relevancy, faithfulness, and contextual relevancy. Together they tell you whether a bad answer came from the generator, the retriever, or both. RAG triad guide
Here is the split:
- Answer relevancy asks whether the output answers the user’s question and stays on topic.
- Faithfulness asks whether every claim in the answer is grounded in the retrieved context.
- Contextual relevancy asks whether the retriever fetched the right context in the first place. RAG triad guide
That division is more valuable than one aggregate “RAG score.” If contextual relevancy is weak, your retriever likely missed. If contextual relevancy is fine but faithfulness is weak, the generator likely drifted beyond evidence.
This is one reason DeepEval sits above classic string-overlap evaluation. Metrics like BLEU or ROUGE can be useful in narrow cases, but they do not explain retriever-versus-generator failure in a grounded QA pipeline. For that contrast, see Hugging Face Evaluate: The One-Liner Everyone Gets Wrong.
DeepEval in CI: why the pytest analogy holds
The pytest analogy is not branding fluff. It maps to actual usage.
You define a test, attach metrics, and assert on them:
from deepeval import assert_test
def test_refund_answer():
assert_test(test_case, [correctness])Then you run the suite:
deepeval test run test_refund.pyDeepEval also supports evaluate() for batch scoring outside the unit-test harness. getting started, CI docs
This matters operationally. Teams often know they should evaluate, but the evals live in ad hoc scripts, notebooks, or dashboards detached from shipping workflows. DeepEval moves the check into the same control plane engineers already trust: the build.
That does not solve the epistemic problem of “is this metric valid?” But it does solve the engineering problem of “did we re-run the check, and can a regression stop deployment?”
Metric coverage and judge-model limits
DeepEval documents more than 50 built-in metrics across custom, RAG, agentic, multi-turn, safety, and multimodal families. That includes G-Eval and DAG, retriever metrics like contextual precision, recall, and relevancy, generator metrics like answer relevancy and faithfulness, agentic metrics like task completion and tool correctness, and safety metrics such as bias, toxicity, PII leakage, and misuse. metrics intro
That breadth is useful, but it comes with a cost: most of the interesting metrics require a judge model. DeepEval’s docs point to a default OPENAI_API_KEY path and also support custom models. They also note that statistical scorers such as BLEU, ROUGE, and BLEURT can run without a judge, but are weak when outputs require reasoning. metrics intro
So the framework’s real question is not “can it produce a score?” It can. The question is which scores deserve to gate CI in your application, and which scores should stay advisory.
OSS DeepEval vs Confident AI platform
DeepEval is the open-source library. It runs standalone and does not require a platform account. GitHub
Confident AI is the optional paid platform on top, covering things like datasets, tracing, and production monitoring. DeepTeam is a sibling package for red-teaming and safety. Synthetic dataset generation through the Synthesizer is part of the OSS library. Confident AI
That distinction is worth keeping clear because the open-source eval harness is already useful on its own.
Common questions
What is DeepEval?
DeepEval is an open-source LLM evaluation framework by Confident AI, positioned as “Pytest for LLMs.” It lets you define test cases, attach metrics, and run them in CI so quality regressions fail builds. GitHub, docs
How does DeepEval work in CI?
You write a test with assert_test(test_case, [metrics]) and run it with deepeval test run. Each metric returns a score from 0 to 1, a threshold that defaults to 0.5, and a reason. If the score falls below the threshold, the test fails. getting started, CI docs, metrics intro
What is G-Eval in DeepEval?
G-Eval is DeepEval’s flexible LLM-as-a-judge metric for custom criteria. It uses a chain-of-thought style two-step process to derive evaluation steps from your criteria and then score the output against those steps. DeepEval metrics docs
What is the DAG metric in DeepEval?
DAG, which DeepEval names “Deep Acyclic Graph,” is a deterministic decision-tree metric built from explicit node types. The score comes from the path taken through the tree rather than from a judge model directly assigning a number. DAG docs
What is the DeepEval RAG triad?
The RAG triad is answer relevancy, faithfulness, and contextual relevancy. Together they separate generator problems from retriever problems by checking whether the answer is on-topic, grounded in retrieved context, and supported by the right retrieved documents. RAG triad guide
Is DeepEval the same as Confident AI?
No. DeepEval is the open-source Apache 2.0 library that runs standalone. Confident AI is the optional paid platform on top, and DeepTeam is a separate sibling package for red-teaming and safety. GitHub, Confident AI
Related
- LLM-as-a-Judge: Bias, Leniency & the LoCoMo Number
- LangChain & LangSmith Evaluation: The Memory Blind Spot
- Hugging Face Evaluate: The One-Liner Everyone Gets Wrong
- How to Evaluate AI Agent Memory
DeepEval helps you test what a model did on a case, turn, or pipeline step. It does not by itself answer the longer-horizon question of whether an agent remembered the right thing across sessions. That is a separate evaluation problem, and it is where persistent-memory systems become relevant.
Mnemoverse is a persistent-memory API for AI agents. Free key: console.mnemoverse.com · Docs: Getting Started
