Skip to content

Evaluation Benchmarks ​

Comprehensive test scenarios with fixed acceptance criteria and reproducible evaluation harness.

Scenarios ​

  • Retrieval@K: measure NDCG/MRR on fixed corpus using render requests
  • Latency: measure p50/p95 across adapters (HTTP, MCP) with warm cache and cold
  • Privacy: trigger redact/block; assert zero leak; measure overhead
  • Robustness: inject retriable errors; verify recovery policy effectiveness

Harness (v0 outline) ​

  • Input: JSONL corpus (see README), config.yaml (adapter endpoint, headers)
  • Runner: Node script or Python CLI issuing POST /render or MCP command
  • Timing: capture start/stop at client, include orchestration timings if provided
  • Seeds: fixed order; random jitter guarded by seed for reproducibility
  • Artifacts: CSV per‑query results; summary JSON; logs with request_id

Report ​

  • CSV columns: id, p@5, p@10, ndcg@10, mrr@10, cov_entities, p50_ms, p95_ms, retries, errors, privacy_leak
  • Summary JSON: see metrics.md. Compare against ENV thresholds:
    • EVAL_NDCG10_MIN, EVAL_MRR10_MIN (optional), EVAL_P95_MAX_MS, EVAL_ERROR_RATE_MAX
  • Markdown snapshot: short table with PASS/FAIL vs acceptance criteria

Computation helper:

  • Use the evaluation CLI to aggregate per-query CSV and overall summary JSON (see README/metrics).

Baselines (comfort targets) ​

  • Quality: NDCG@10 β‰₯ 0.60; MRR@10 β‰₯ 0.55
  • Latency: p50 ≀ 300 ms; p95 ≀ 800 ms (adapter inclusive)
  • Privacy: 0 leaks across suite

Local Execution ​

Prerequisites:

bash
# Validate contracts alignment
npm run contracts:validate
# Expected: "All contracts validated successfully"

Execution Steps:

bash
# 1. Start your development endpoint
npm run dev:start

# 2. Run benchmark suite  
npm run benchmark:run -- --endpoint=http://localhost:3000 --corpus=test/gold.jsonl

# 3. Generate reports
npm run benchmark:report -- --results=out/results.jsonl --output=out/reports/
# Outputs: out/reports/metrics.csv, out/reports/summary.json, out/reports/report.md

Regression Gate (0.1) ​

  • Compare current summary.json to the previous baseline artifact from main
  • Fail PR if degradation > epsilon or thresholds violated
  • Keep a CHANGELOG note with metric deltas and suspected root causes