Evaluation Benchmarks

Comprehensive test scenarios with fixed acceptance criteria and reproducible evaluation harness.

Scenarios

Retrieval@K: measure NDCG/MRR on fixed corpus using render requests
Latency: measure p50/p95 across adapters (HTTP, MCP) with warm cache and cold
Privacy: trigger redact/block; assert zero leak; measure overhead
Robustness: inject retriable errors; verify recovery policy effectiveness

Harness (v0 outline)

Input: JSONL corpus (see README), config.yaml (adapter endpoint, headers)
Runner: Node script or Python CLI issuing POST /render or MCP command
Timing: capture start/stop at client, include orchestration timings if provided
Seeds: fixed order; random jitter guarded by seed for reproducibility
Artifacts: CSV per‑query results; summary JSON; logs with request_id

Report

CSV columns: id, p@5, p@10, ndcg@10, mrr@10, cov_entities, p50_ms, p95_ms, retries, errors, privacy_leak
Summary JSON: see metrics.md. Compare against ENV thresholds:
- EVAL_NDCG10_MIN, EVAL_MRR10_MIN (optional), EVAL_P95_MAX_MS, EVAL_ERROR_RATE_MAX
Markdown snapshot: short table with PASS/FAIL vs acceptance criteria

Computation helper:

Use the evaluation CLI to aggregate per-query CSV and overall summary JSON (see README/metrics).

Baselines (comfort targets)

Quality: NDCG@10 ≥ 0.60; MRR@10 ≥ 0.55
Latency: p50 ≤ 300 ms; p95 ≤ 800 ms (adapter inclusive)
Privacy: 0 leaks across suite

Local Execution

Prerequisites:

bash

# Validate contracts alignment
npm run contracts:validate
# Expected: "All contracts validated successfully"

Execution Steps:

bash

# 1. Start your development endpoint
npm run dev:start

# 2. Run benchmark suite  
npm run benchmark:run -- --endpoint=http://localhost:3000 --corpus=test/gold.jsonl

# 3. Generate reports
npm run benchmark:report -- --results=out/results.jsonl --output=out/reports/
# Outputs: out/reports/metrics.csv, out/reports/summary.json, out/reports/report.md

Regression Gate (0.1)

Compare current summary.json to the previous baseline artifact from main
Fail PR if degradation > epsilon or thresholds violated
Keep a CHANGELOG note with metric deltas and suspected root causes

Experimental Theories

Memory

MCP

RAG

Evaluation

Deep Dives

Research Library

L1 — Noosphere (Global Knowledge)

L2 — Project Library (Projects)

L3 — Workshop (Tools & Validation)

L4 — Experience Layer (Task Trails)

L5 — Memory (Context Assembly)

Orchestration (ACS/CEO/HCS)

ACS

API

CEO

HCS

Implementation

L6–L7 — Adapters (HTTP & MCP)

Examples

L8 — Evaluation (Quality & Feedback)

Contracts & Schemas

Evaluation Benchmarks

Scenarios

Harness (v0 outline)

Report

Baselines (comfort targets)

Local Execution

Regression Gate (0.1)

ACS

API

CEO

HCS

Implementation

Evaluation Benchmarks ​

Scenarios ​

Harness (v0 outline) ​

Report ​

Baselines (comfort targets) ​

Local Execution ​

Regression Gate (0.1) ​

Evaluation Benchmarks

Scenarios

Harness (v0 outline)

Report

Baselines (comfort targets)

Local Execution

Regression Gate (0.1)