Skip to content

LangChain / LangSmith Evaluation: What It Measures — and the One Thing It Can't

You can no longer ship an LLM agent on vibes. But every dashboard you'll reach for — LangSmith, RAGAS, Braintrust — shares one blind spot: it scores an output, never whether your agent remembers. This is a practitioner's tour of what LangSmith evaluation genuinely does well, where the LLM-as-judge underneath it wobbles, and the measurement problem none of these tools touch.

TL;DR

  • LangSmith evaluation answers is this response good well — datasets, examples, experiments, offline/online evaluators, annotation queues — but it scores an output, never whether your agent remembers.
  • The LLM-as-judge underneath most quality scores is shakier than the number implies: MT-Bench documents position bias, verbosity bias, and self-preference; a single absolute LLM score is the weakest signal in your pipeline.
  • RAGAS, DeepEval, TruLens, Braintrust, and OpenAI Evals are all useful — and all scoped to outputs; memory quality is a property of behavior across many sessions, not one run.
  • Memory is measured by separate benchmarks (LoCoMo, LongMemEval, BEAM), but there is no neutral leaderboard — read every memory score with its config, date, and limits in hand.

Three questions get routinely conflated. Keep them apart:

QuestionUnit measuredTool that answers it
Is this response correct/helpful?one outputLangSmith, RAGAS, DeepEval
Is the retrieval grounded?query + contextRAGAS, TruLens
Does the agent remember across sessions?behavior over timememory benchmarks (LoCoMo / LongMemEval / BEAM)

The first two are well-served. The third is where most agents now live or die — and almost nothing measures it.

What LangSmith genuinely does well

LangSmith's model rests on three primitives (docs): a dataset of examples, an example (inputs + optional reference outputs), and an experiment (one app version run over a dataset, capturing outputs, scores, and traces). It splits offline evaluation — datasets, regression tests, backtesting against historical inputs — from online evaluation that scores production traffic in near real-time (types).

LangSmith evaluation rests on three primitives — a dataset of examples, an example (inputs plus optional reference outputs), and an experiment (one app version run over a dataset, capturing outputs, scores, and traces).

Evaluators come in several shapes: LLM-as-judge, deterministic code checks, pairwise, and summary statistics, driven by the evaluate() API in Python and TypeScript (quickstart):

python
from langsmith import Client
client = Client()

client.evaluate(
    my_app,                              # the function under test
    data="my-dataset",
    evaluators=[correctness, conciseness],
)

For human review, annotation queues — single and pairwise — route runs to reviewers and feed their corrections back into datasets (docs). It's framework-agnostic, ships open-source evaluator libraries (OpenEvals, AgentEvals), and prices accessibly: a free Developer tier, $39/seat Plus, custom Enterprise (pricing).

This is a mature, honest answer to is this response good. Hold that scope in mind: a response.

The LLM-as-judge underneath is shakier than the number implies

Most "quality scores" are an LLM grading an LLM, and that judge is less reliable than a clean number suggests. The foundational MT-Bench study found strong judges agree with humans over 80% of the time — as often as humans agree with each other. But the same paper documents the failure modes: position bias (even GPT-4 gave the same verdict after swapping answer order only ~65% of the time), verbosity bias (a longer-but-emptier answer fooled weaker judges 91.3% of the time), and self-preference (judges appear to lean toward their own model family). And determinism doesn't rescue it — "Can You Trust LLM Judgments?" shows a fixed seed and low temperature improve consistency but don't guarantee reliability.

LLM-as-judge is an LLM grading another LLM's output, a judge that is less reliable than a clean number suggests and prone to position, verbosity, and self-preference bias.

The mitigations are unglamorous: randomize answer order, ensemble several judges, calibrate against a human-labeled set, keep temperature low. They bound the error; they don't erase it. A single absolute LLM score is the weakest signal in your pipeline.

The landscape, one honest line each

  • RAGAS — RAG-specific metrics (faithfulness, context precision/recall) + test-data generation.
  • DeepEval — "pytest for LLMs": unit-test-style assertions wired into CI.
  • TruLens — feedback functions and tracing, now under Snowflake.
  • Braintrust — a commercial eval + observability platform.
  • OpenAI Evals — the open-source framework is alive; OpenAI's hosted Evals product is being deprecated (shut down 2026-11-30, with Promptfoo as the migration path).

Useful tools, all of them. All scoped to outputs.

Why you can't evaluate a LangChain agent's memory with these tools

Here is what they share. They evaluate an output — a run, a trajectory, a thread — against reference data. That is the right unit for did this response answer the question. It is the wrong unit for does this agent remember.

Memory quality isn't a property of one run. It's a property of behavior across many sessions over time: does the agent recall a fact you mentioned three sessions ago, consolidate instead of duplicating, notice when new information contradicts what it stored, weight recent context correctly. LangSmith's documented scope — examples, runs, threads against reference outputs — doesn't describe cross-session persistence or contradiction handling. That is not a flaw in LangSmith; it is a different measurement problem, and it needs a different instrument.

Agent memory evaluation measures a property of behavior across many sessions over time — whether the agent recalls earlier facts, consolidates instead of duplicating, notices contradictions, and weights recent context correctly — not a property of any one run.

How agent memory evaluation actually works: LoCoMo, LongMemEval, BEAM

A small set of benchmarks tackle it, and their short history is a lesson in how hard honest measurement is.

LoCoMo (2024) built very long multi-session dialogues and became the default. It's now saturating: an independent analysis shows the conversations fit inside modern context windows, so a plain full-context baseline (~73%) can beat a purpose-built memory system (~68%). Worse, a Penfield Labs audit found ~6.4% of its answer key is wrong and that the judge accepted up to ~63% of deliberately wrong answers. Take any single LoCoMo number with caution.

LongMemEval (ICLR 2025) is more disciplined — 500 curated questions across recall, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — and found commercial assistants drop ~30% in sustained-memory accuracy. BEAM (arXiv 2025) raises the ceiling: 100 conversations up to 10M tokens, testing 10 abilities including contradiction resolution and event ordering that the earlier benchmarks omit — and even 1M-token-context models struggle, so a bigger window is not the fix.

One caveat ties them together: there is no neutral leaderboard. Vendors self-report on their own harnesses with their own judges, so the numbers aren't comparable — one re-run moved one vendor's own score from ~66% to ~75% purely by fixing three implementation bugs in the harness — and judge variance alone can swing them further. Read every memory score with its config, date, and limits in hand.

The takeaway

Evaluate the response and evaluate the memory — they are two measurements, not one. LangSmith (and RAGAS, DeepEval, TruLens, Braintrust) will tell you whether a response is good; that part is solved. They will not tell you whether your agent's memory is trustworthy next week.

That second measurement is the problem Mnemoverse works on directly — a persistent-memory engine whose job is recall, consolidation, and contradiction handling across sessions. How it measures up on the public memory benchmarks, limits and all, is on the benchmarks page.

Common questions

What does LangSmith evaluation actually measure? LangSmith evaluation scores whether a response is good. Its model rests on three primitives — a dataset of examples, an example (inputs plus optional reference outputs), and an experiment (one app version run over a dataset, capturing outputs, scores, and traces).

What is LLM-as-judge in LangSmith, and is it reliable? Most quality scores are an LLM grading an LLM. Strong judges agree with humans over 80% of the time, but the same MT-Bench study documents position bias, verbosity bias, and self-preference. A single absolute LLM score is the weakest signal in your pipeline.

How do I make an LLM-as-judge evaluator more trustworthy? The mitigations are unglamorous: randomize answer order, ensemble several judges, calibrate against a human-labeled set, and keep temperature low. They bound the error; they don't erase it. A fixed seed and low temperature improve consistency but don't guarantee reliability.

Can I evaluate a LangChain agent's memory with LangSmith? No. LangSmith's documented scope — examples, runs, threads against reference outputs — doesn't describe cross-session persistence or contradiction handling. Memory quality is a property of behavior across many sessions, not one run. It is a different measurement problem needing a different instrument.

How is agent memory evaluated instead? Separate benchmarks tackle it: LoCoMo (2024, now saturating and with answer-key issues), LongMemEval (ICLR 2025, 500 curated questions), and BEAM (arXiv 2025, up to 10M tokens, testing contradiction resolution and event ordering). There is no neutral leaderboard, so read each score with its config, date, and limits.

How does LangSmith compare to RAGAS, DeepEval, TruLens, and Braintrust? All are useful, and all scoped to outputs. RAGAS does RAG-specific metrics plus test-data generation, DeepEval is pytest-for-LLMs assertions in CI, TruLens does feedback functions and tracing, and Braintrust is a commercial eval and observability platform. None evaluate whether your agent remembers.

Sources


Edward Izgorodin, June 2026 — LinkedIn