title: Evaluation Metrics description: Precise definitions and computation methods for all quality, operational, and privacy metrics. priority: 0.9 lastmod: 2025-09-06 type: specification status: published ​
Evaluation Metrics ​
Canonical definitions for all metrics used in continuous system evaluation and improvement.
Quality ​
- Precision@K: relevant_returned_at_K / K
- Recall@K: relevant_returned_at_K / total_relevant
- NDCG@K: DCG@K / IDCG@K, gain=1 for expected_ids; log2 discount
- MRR@K: 1 / rank_of_first_relevant (0 if none in top K)
- Coverage_entities: fraction of entity tags present in retrieved context
Notes
- expected_ids are canonical fragment IDs (global:..., layer:...)
- multiple relevant items permitted; tie‑break by original ranking
- Groundedness (LLM-as-judge): Optional in v0.1, requires calibration against human-annotated subset (minimum 100 samples)
Operational ​
- Latency p50/p95/p99 (ms): end‑to‑end adapter → orchestration → reply
- CPU/Memory: process‑level samples; optional, not a hard gate in v0
- Error rate: 4xx/5xx per 1k requests; exclude client aborts
- Timeouts: fraction exceeding soft_deadline_ms (see contracts)
- Retries: count and reasons (from retry catalog)
Deterministic Evaluation Requirements:
- Fix random seeds for all stochastic components
- Freeze corpus versions (use git commit hash or version tag)
- Pin model versions and prompt templates
- For network variance: run 3 iterations minimum, report median values
Privacy ​
- Redaction efficacy: 1 − leaked_tokens / sensitive_tokens
- Block efficacy: 1 if no sensitive emitted, else 0
- Secure field bypass: fraction of inputs correctly bypassed
Mapping to Orchestration Metrics ​
Use fields from ../orchestration/metrics.md when available:
- request_id, adapter, surface
- timings: parse_ms, retrieval_ms, synthesis_ms, total_ms
- retry: attempts, last_reason
- privacy: mode, redacted, blocked
Reporting ​
Per‑query row (CSV):
- id, p_at_5, p_at_10, ndcg_10, mrr_10, cov_entities, p50_ms, p95_ms, errors, retries, privacy_leak
Summary JSON:
{
"version": "v0",
"quality": { "p@5": 0.55, "ndcg@10": 0.62, "mrr@10": 0.58 },
"operational": { "p50_ms": 220, "p95_ms": 710, "error_rate": 0.004 },
---
title: Evaluation Metrics
description: Precise definitions and computation methods for all quality, operational, and privacy metrics.
priority: 0.9
lastmod: 2025-09-06
type: specification
status: published
---
# Evaluation Metrics
Canonical definitions for all metrics used in continuous system evaluation and improvement.
## Quality
- Precision@K: relevant_returned_at_K / K
- Recall@K: relevant_returned_at_K / total_relevant
- NDCG@K: DCG@K / IDCG@K, gain=1 for expected_ids; log2 discount
- MRR@K: 1 / rank_of_first_relevant (0 if none in top K)
- Coverage_entities: fraction of entity tags present in retrieved context
### Notes
- expected_ids are canonical fragment IDs (global:..., layer:...)
- multiple relevant items permitted; tie‑break by original ranking
- Groundedness (LLM-as-judge): optional in v0.1; requires calibration against a human-annotated subset (≥ 100 samples)
## Operational
- Latency p50/p95/p99 (ms): end‑to‑end adapter → orchestration → reply
- CPU/Memory: process‑level samples; optional, not a hard gate in v0
- Error rate: 4xx/5xx per 1k requests; exclude client aborts
- Timeouts: fraction exceeding soft_deadline_ms (see contracts)
- Retries: count and reasons (from retry catalog)
### Deterministic evaluation requirements
- Fix random seeds for all stochastic components
- Freeze corpus versions (use git commit hash or version tag)
- Pin model versions and prompt templates
- For network variance: run ≥3 iterations, report median values
## Privacy
- Redaction efficacy: 1 − leaked_tokens / sensitive_tokens
- Block efficacy: 1 if no sensitive emitted, else 0
- Secure field bypass: fraction of inputs correctly bypassed
## Mapping to Orchestration Metrics
Use fields from ../orchestration/metrics.md when available:
- request_id, adapter, surface
- timings: parse_ms, retrieval_ms, synthesis_ms, total_ms
- retry: attempts, last_reason
- privacy: mode, redacted, blocked
## Reporting
Per‑query row (CSV):
- id, p_at_5, p_at_10, ndcg_10, mrr_10, cov_entities, p50_ms, p95_ms, errors, retries, privacy_leak
Summary JSON:
```json
{
"version": "v0",
"quality": `{ "p@5": 0.55, "ndcg@10": 0.62, "mrr@10": 0.58 }`,
"operational": `{ "p50_ms": 220, "p95_ms": 710, "error_rate": 0.004 }`,
"privacy": `{ "leak_rate": 0.0 }`
}
```
Tooling (v0):
- A lightweight CLI computes these metrics from JSONL inputs and writes `out/metrics.csv` and `out/summary.json`. See README for workflow and usage.
## Metric Ownership (v0.1)
Required: each key metric must have a designated owner responsible for monitoring and improvement.
| Layer | Metrics | Owner Role |
|----------------|----------------------------------|-----------------------|
| L1 Noosphere | recall@10, ndcg@10 | Noosphere Tech Lead |
| L3 Workshop | tool_success_rate, p95_latency | Tools Tech Lead |
| Orchestration | end_to_end_p95, error_rate | ACS/CEO Tech Lead |
Owner Responsibilities:
- Monitor metric SLA compliance weekly
- Investigate and resolve degradation within 48 hours
- Propose acceptance criteria updates quarterly