Skip to content

title: Evaluation Metrics description: Precise definitions and computation methods for all quality, operational, and privacy metrics. priority: 0.9 lastmod: 2025-09-06 type: specification status: published ​

Evaluation Metrics ​

Canonical definitions for all metrics used in continuous system evaluation and improvement.

Quality ​

  • Precision@K: relevant_returned_at_K / K
  • Recall@K: relevant_returned_at_K / total_relevant
  • NDCG@K: DCG@K / IDCG@K, gain=1 for expected_ids; log2 discount
  • MRR@K: 1 / rank_of_first_relevant (0 if none in top K)
  • Coverage_entities: fraction of entity tags present in retrieved context

Notes

  • expected_ids are canonical fragment IDs (global:..., layer:...)
  • multiple relevant items permitted; tie‑break by original ranking
  • Groundedness (LLM-as-judge): Optional in v0.1, requires calibration against human-annotated subset (minimum 100 samples)

Operational ​

  • Latency p50/p95/p99 (ms): end‑to‑end adapter → orchestration → reply
  • CPU/Memory: process‑level samples; optional, not a hard gate in v0
  • Error rate: 4xx/5xx per 1k requests; exclude client aborts
  • Timeouts: fraction exceeding soft_deadline_ms (see contracts)
  • Retries: count and reasons (from retry catalog)

Deterministic Evaluation Requirements:

  • Fix random seeds for all stochastic components
  • Freeze corpus versions (use git commit hash or version tag)
  • Pin model versions and prompt templates
  • For network variance: run 3 iterations minimum, report median values

Privacy ​

  • Redaction efficacy: 1 − leaked_tokens / sensitive_tokens
  • Block efficacy: 1 if no sensitive emitted, else 0
  • Secure field bypass: fraction of inputs correctly bypassed

Mapping to Orchestration Metrics ​

Use fields from ../orchestration/metrics.md when available:

  • request_id, adapter, surface
  • timings: parse_ms, retrieval_ms, synthesis_ms, total_ms
  • retry: attempts, last_reason
  • privacy: mode, redacted, blocked

Reporting ​

Per‑query row (CSV):

  • id, p_at_5, p_at_10, ndcg_10, mrr_10, cov_entities, p50_ms, p95_ms, errors, retries, privacy_leak

Summary JSON:

{
	"version": "v0",
	"quality": { "p@5": 0.55, "ndcg@10": 0.62, "mrr@10": 0.58 },
	"operational": { "p50_ms": 220, "p95_ms": 710, "error_rate": 0.004 },
	---
	title: Evaluation Metrics
	description: Precise definitions and computation methods for all quality, operational, and privacy metrics.
	priority: 0.9
	lastmod: 2025-09-06
	type: specification
	status: published
	---

	# Evaluation Metrics

	Canonical definitions for all metrics used in continuous system evaluation and improvement.

	## Quality

	- Precision@K: relevant_returned_at_K / K
	- Recall@K: relevant_returned_at_K / total_relevant
	- NDCG@K: DCG@K / IDCG@K, gain=1 for expected_ids; log2 discount
	- MRR@K: 1 / rank_of_first_relevant (0 if none in top K)
	- Coverage_entities: fraction of entity tags present in retrieved context

	### Notes
	- expected_ids are canonical fragment IDs (global:..., layer:...)
	- multiple relevant items permitted; tie‑break by original ranking
	- Groundedness (LLM-as-judge): optional in v0.1; requires calibration against a human-annotated subset (≥ 100 samples)

	## Operational

	- Latency p50/p95/p99 (ms): end‑to‑end adapter → orchestration → reply
	- CPU/Memory: process‑level samples; optional, not a hard gate in v0
	- Error rate: 4xx/5xx per 1k requests; exclude client aborts
	- Timeouts: fraction exceeding soft_deadline_ms (see contracts)
	- Retries: count and reasons (from retry catalog)

	### Deterministic evaluation requirements
	- Fix random seeds for all stochastic components
	- Freeze corpus versions (use git commit hash or version tag)
	- Pin model versions and prompt templates
	- For network variance: run ≥3 iterations, report median values

	## Privacy

	- Redaction efficacy: 1 − leaked_tokens / sensitive_tokens
	- Block efficacy: 1 if no sensitive emitted, else 0
	- Secure field bypass: fraction of inputs correctly bypassed

	## Mapping to Orchestration Metrics

	Use fields from ../orchestration/metrics.md when available:
	- request_id, adapter, surface
	- timings: parse_ms, retrieval_ms, synthesis_ms, total_ms
	- retry: attempts, last_reason
	- privacy: mode, redacted, blocked

	## Reporting

	Per‑query row (CSV):
	- id, p_at_5, p_at_10, ndcg_10, mrr_10, cov_entities, p50_ms, p95_ms, errors, retries, privacy_leak

	Summary JSON:
	```json
	{
		"version": "v0",
		"quality": `{ "p@5": 0.55, "ndcg@10": 0.62, "mrr@10": 0.58 }`,
		"operational": `{ "p50_ms": 220, "p95_ms": 710, "error_rate": 0.004 }`,
		"privacy": `{ "leak_rate": 0.0 }`
	}
	```

	Tooling (v0):
	- A lightweight CLI computes these metrics from JSONL inputs and writes `out/metrics.csv` and `out/summary.json`. See README for workflow and usage.

	## Metric Ownership (v0.1)

	Required: each key metric must have a designated owner responsible for monitoring and improvement.

	| Layer          | Metrics                         | Owner Role            |
	|----------------|----------------------------------|-----------------------|
	| L1 Noosphere   | recall@10, ndcg@10              | Noosphere Tech Lead   |
	| L3 Workshop    | tool_success_rate, p95_latency  | Tools Tech Lead       |
	| Orchestration  | end_to_end_p95, error_rate      | ACS/CEO Tech Lead     |

	Owner Responsibilities:
	- Monitor metric SLA compliance weekly
	- Investigate and resolve degradation within 48 hours
	- Propose acceptance criteria updates quarterly