Evaluation Benchmarks β
Comprehensive test scenarios with fixed acceptance criteria and reproducible evaluation harness.
Scenarios β
- Retrieval@K: measure NDCG/MRR on fixed corpus using render requests
- Latency: measure p50/p95 across adapters (HTTP, MCP) with warm cache and cold
- Privacy: trigger redact/block; assert zero leak; measure overhead
- Robustness: inject retriable errors; verify recovery policy effectiveness
Harness (v0 outline) β
- Input: JSONL corpus (see README), config.yaml (adapter endpoint, headers)
- Runner: Node script or Python CLI issuing POST /render or MCP command
- Timing: capture start/stop at client, include orchestration timings if provided
- Seeds: fixed order; random jitter guarded by seed for reproducibility
- Artifacts: CSV perβquery results; summary JSON; logs with request_id
Report β
- CSV columns: id, p@5, p@10, ndcg@10, mrr@10, cov_entities, p50_ms, p95_ms, retries, errors, privacy_leak
- Summary JSON: see metrics.md. Compare against ENV thresholds:
- EVAL_NDCG10_MIN, EVAL_MRR10_MIN (optional), EVAL_P95_MAX_MS, EVAL_ERROR_RATE_MAX
- Markdown snapshot: short table with PASS/FAIL vs acceptance criteria
Computation helper:
- Use the evaluation CLI to aggregate per-query CSV and overall summary JSON (see README/metrics).
Baselines (comfort targets) β
- Quality: NDCG@10 β₯ 0.60; MRR@10 β₯ 0.55
- Latency: p50 β€ 300 ms; p95 β€ 800 ms (adapter inclusive)
- Privacy: 0 leaks across suite
Local Execution β
Prerequisites:
bash
# Validate contracts alignment
npm run contracts:validate
# Expected: "All contracts validated successfully"
Execution Steps:
bash
# 1. Start your development endpoint
npm run dev:start
# 2. Run benchmark suite
npm run benchmark:run -- --endpoint=http://localhost:3000 --corpus=test/gold.jsonl
# 3. Generate reports
npm run benchmark:report -- --results=out/results.jsonl --output=out/reports/
# Outputs: out/reports/metrics.csv, out/reports/summary.json, out/reports/report.md
Regression Gate (0.1) β
- Compare current summary.json to the previous baseline artifact from main
- Fail PR if degradation > epsilon or thresholds violated
- Keep a CHANGELOG note with metric deltas and suspected root causes