KB Evaluation Plan
Goal
Evaluate whether building and maintaining a kb knowledge base is materially useful for real development work, and whether a split workflow works better:
- Agent A builds the product/codebase.
- Agent B maintains and refreshes the knowledge base.
The core hypothesis is that this produces better systems faster, with lower token cost, better requirement capture, and better recall of project knowledge than relying on the primary coding agent’s transient context alone.
Primary Question
After building a KB from scratch for the evaluation target repository, can kb answer important questions about the project accurately and usefully enough to justify the extra maintenance work?
Secondary Questions
- Does
kb initproduce a usable knowledge base from the current repo without manual surgery? - Does the resulting KB support both retrieval-style questions (
kb query) and synthesis-style questions (kb chat) across multiple topic areas? - Is the resulting graph store populated enough to plausibly improve retrieval and follow-up questioning?
- What is the cost of producing this KB in time, tokens, and operator effort?
- In a later comparison run, does a dedicated KB-maintenance agent improve outcomes versus a single-agent baseline?
Evaluation Target
Canonical external benchmark: the raylib C library — use suite raylib (its repo_url is defined in eval/suites/raylib.yaml; override with --repo only when needed).
Reasons: mature, well-documented, not kb itself (avoids evaluator familiarity bias), rich graph structure, stable upstream.
Kb self-check: suite kb (its repo_url is defined in eval/suites/kb.yaml; override with --repo only when needed). That is a product smoke test, not the primary raylib benchmark.
For day-to-day kb architecture work on your checkout, use --base dogfood (separate from disposable eval bases).
Base naming convention
| Base | Purpose | Lifetime |
|---|---|---|
<repo-leaf>-YYYY-MM-DD-HHmm |
Default disposable base from eval-run.mjs: same string as ~/.kb/evaluations/<run-name>/ (override with --base) |
Ephemeral |
raylib |
Persistent agent comparison base — accumulates across tasks, never wiped | Permanent |
dogfood |
kb’s own architectural knowledge | Permanent |
The raylib base is the KB that a KB-backed agent would actually use during real development on a long-lived raylib tree. Do not reuse ephemeral eval run names for it — the compounding hypothesis requires the same base to persist across multiple task sessions.
Published docs location
Eval runs do not publish Jekyll output. We only capture init/query evidence artifacts.
Automated harvest (scripts/eval-run.mjs)
One runner drives all disposable-base harvests. No local target directory: repo URL resolves from suite YAML repo_url, with optional --repo override. Each run does a fresh snapshot clone under ~/.kb/evaluations/<run-name>/repo/; scratch JSON and the default artifact live under ~/.kb/evaluations/<run-name>/. Default --base for init / all equals <run-name> (e.g. raylib-2026-04-27-1303 = repo leaf + date + HHmm; same-minute collision adds -2, -3, …).
From the kb repo root (after pnpm run build):
| npm script | Maps to |
|---|---|
npm run eval:init -- --suite raylib … |
Clone suite repo_url, init, queries |
npm run eval:init -- --suite kb … |
Clone suite repo_url, init, queries |
npm run eval:init -- --suite generic --repo <git-url> … |
Generic suite requires explicit repo override, then clone/init/queries |
npm run eval:all … |
Same as eval:init (alias) |
npm run eval:query -- --suite raylib --base <existing> … |
No init: docs + graph + logs + 8× kb query (still clones repo for cwd) |
npm run eval:gen-doc |
kb docs generate smoke (introduction + howto) on --base (default dogfood); artifact under ~/.kb/evaluations/<run>/ |
Modes
init(or legacyall) — Fresh clone →kb init --non-interactive, then metrics + eight queries.query— Fresh clone → same capture minus init; requires--basefor an already-populated KB session.
Suites (--suite)
raylib— Eight raylib-specific questions (this document).kb— Eight kb-repo / product questions (contributor dogfood).generic— Eight repo-neutral questions. Use with--repofor arbitrary upstreams.
Override questions with --questions-file path.json (JSON array of exactly eight strings) to lock a custom suite without forking the script.
Example
npm run eval:init -- --suite generic \
--repo https://github.com/raysan5/raylib.git \
--label raylib-upstream-smoke \
--auto-score
Options: --clone-branch main, --clone-depth 1 (default shallow; use 0 for full history). The artifact records run.clone_url, run.target_cwd, run.run_dir, run.run_name.
Artifacts
- Default path:
~/.kb/evaluations/<run-name>/artifact.json. Override with--out. - Rebuild artifact from existing scratch:
--skip-init --run-dir ~/.kb/evaluations/<run-name>/(expects matching clone at~/.kb/evaluations/<run-name>/repo/). - Automated artifacts may include extra
runfields for traceability. Tools should treat unknown keys as forward-compatible metadata.
Docs generate smoke (scripts/eval-gen-doc.mjs)
- Same run root:
~/.kb/evaluations/<run-name>/withartifact.jsonandgen-doc.log. - Flow matches CLI: each scenario runs
docs generate --finalize(draft +awaiting_review), optional--reject-once "<feedback>"(one LLM revision; writesdiff-introduction.txt/diff-howto.txtwhen a patch is produced), thendocs generate --acceptto commit the SQLite document. - Each finalized doc is also written as
export-introduction.mdandexport-howto.md(SQLite body fromdocs view --output json). OpenREADME-exports.mdin that folder for absolute paths and a one-lineopen/xdg-openhint. - Default
--base dogfood. Optional--skip-purgeto skip deleting prior eval-titled docs (ids derived from fixeddocumentTitlestrings). - Exit
1only on hard failure; artifactstatusiscompletewhen automated checks pass for both scenarios. - Interactive parity:
kb chatsupports/docs generate "<prompt>" …(questionnaire + review loop) and/facts(same surface askb facts).
Evaluation Design
This evaluation should be run at least twice against the same codebase snapshot or equivalent branch state:
- Baseline run:
- Build the KB from scratch with the normal workflow.
- No special second-agent KB-maintenance strategy beyond answering
kb initquestions accurately.
- Comparison run:
- Repeat on a fresh disposable base after using the intended two-agent workflow.
- Keep the question set, scoring rubric, and artifact schema identical.
Standard Procedure
Phase 1: Initialize a Fresh KB
From the target repo root (or a clone):
- Run:
kb init --base raylib-2026-04-27-1303 --non-interactive(pick a fresh disposable name;eval-run.mjsgenerates this pattern automatically). - Or interactively: start
kb, then/init --base <same> - Let
kb initcomplete all passes throughpass-graph. - Save the resulting run metadata.
Use a disposable base name that matches your eval run folder when using eval-run.mjs, or any unique name for manual runs.
Phase 2: Capture Build Metrics
Collect:
- Base name
- Git branch and commit of
~/raylib/ - Start/end timestamps
kb initrun ID fromkb logs- Total init duration
- Total init input tokens
- Total init output tokens
- Estimated init cost
- Number of documents created
- Graph entity count
- Graph relationship count
Phase 3: Evaluate Answer Quality
Run a fixed question set:
kb query "<question>" --base ci-raylib-<date> --output jsonkb chatagainst the same base (optional, marknot_capturedif skipped)
Questions adapted for raylib (see Canonical Question Set below).
Phase 4: Score the Results
Each question gets a rubric score on four axes. Use --auto-score with the eval runner when available.
Phase 5: Publish
kb publish jekyll --base ci-raylib-<date> --dir ~/raylib-kb-docs/ --apply
Canonical Question Set
Use these raylib-adapted questions for all runs. If revised, copy the old suite forward and record the change in the artifact.
- What is raylib for, and what are its main capabilities?
- How does raylib’s architecture work, including modules and platform support?
- How do I install and build raylib, including dependencies and build systems?
- What configuration options and compile flags does raylib support?
- How does raylib handle graphics backends and platform-specific rendering?
- What are the coding conventions and style guidelines for contributing to raylib?
- What are the main gotchas, constraints, and known limitations of raylib?
- What does the raylib roadmap say about future plans, and what is the recent version history?
Scoring Rubric
Score each answer on four axes from 0 to 4.
Correctness
4: Factually correct and grounded in the repo/KB.3: Mostly correct with minor omissions.2: Mixed; contains meaningful inaccuracies or unsupported inference.1: Mostly wrong or misleading.0: No useful answer.
Usefulness
4: Directly helps a developer act or understand the system.3: Helpful but incomplete.2: Some signal, but requires substantial follow-up.1: Barely helpful.0: Not helpful.
Specificity
4: Uses concrete repo-specific details, commands, or mechanisms.3: Some concrete detail, but still generic in places.2: Partly generic.1: Mostly generic.0: Purely generic or evasive.
Evidence Handling
4: Clearly constrained to evidence, acknowledges uncertainty appropriately.3: Reasonably evidence-grounded.2: Some speculation or weak grounding.1: Strong speculation or unsupported claims.0: No evidence discipline.
Aggregate Metrics
For each run, compute:
- Mean score per axis for
query - Mean score per axis for
chat - Combined mean score
- Pass rate where
correctness >= 3andusefulness >= 3 - Coverage notes by topic area
Success Thresholds
Treat a run as promising if all are true:
kb initcompletes successfully on a fresh disposable base.- The graph store is populated with non-zero entities and relationships.
- Combined pass rate is at least
70%. - At least
6/8questions scorecorrectness >= 3. - At least
6/8questions scoreusefulness >= 3.
Treat the two-agent theory as supported only if the comparison run beats the baseline on at least one of:
- Better combined answer quality
- Lower total token cost
- Lower elapsed time
- Better requirement/process capture in qualitative notes
without causing a meaningful regression in the other categories.
Artifact Storage
Every run — even weak or partial ones — should still emit a JSON artifact so comparisons stay reproducible. Default layout: evaluation/runs/YYYY-MM-DD-<label>.json. The repo gitignores evaluation/ by default, so these files are not part of normal commits unless you force-add or change ignore rules.
Filename convention: evaluation/runs/YYYY-MM-DD-<label>.json
Reference baseline (historical example path): evaluation/runs/2026-04-19-raylib-baseline.json
Artifact Format
Each artifact should include:
- Run metadata
- Init metrics
- KB state summary
- Full question set used
- Raw
kb queryoutputs - Raw
kb chatoutputs (ornot_captured) - Manual rubric scores
- Aggregate scores
- Qualitative notes
Required JSON Schema
Future agents should treat the JSON shape below as the canonical artifact format. The goal is repeatability across runs even when the agent does not have prior conversational context.
Required top-level fields
schema_versionevaluation_planrun_labelstatuscreated_atrepositoryhypothesisrunquestion_setquery_evaluationchat_evaluationaggregate_scoresqualitative_findingsnext_improvement_areas
Field expectations
schema_version: integer schema version, starting at1evaluation_plan: string path, usuallyEVALUATION.mdrun_label: short label likeraylib-baselineorraylib-compare-agent-bstatus:completeorpartialcreated_at: ISO-8601 timestamprepository: object withname,branch,commithypothesis: short string describing what this run is testingrun: object describing the concrete scenario executionquestion_set: ordered array of the exact questions usedquery_evaluation: ordered array with one item per questionchat_evaluation: ordered array with one item per question, or an object withstatus: "not_captured"plusnotesif chat was not capturedaggregate_scores: computed summary metricsqualitative_findings: flat array of short observationsnext_improvement_areas: flat array of likely follow-up improvements
Required run object
The run object should contain:
basemode(string describing capture style, e.g.non_interactive_cli_initorquery_only_harvest)commandsinit_result
Recommended for automated runs (so comparisons stay attributable):
eval_mode:allorquery— whetherkb initwas executed in this capturesuite:raylib|kb|generic(or custom label if using--questions-file)target_cwd: absolute path wherekbcommands ranclone_url: if the target was produced bygit clone, the URL (elsenull)publish_dir: Jekyll site root if publish ran (elsenull)workdir: scratch directory holdingq1.json…q8.json(safe to delete after archiving)
The init_result object should contain:
statuswritten_docswritten_doc_idsinit_run_idinit_run_id_notedocs_listgraph_summary
If a field is unavailable, include it with null and explain why in a sibling *_note field when appropriate.
Required per-question shape
Each item in query_evaluation and chat_evaluation should contain:
question_idquestionresult_countretrievalanswer_excerptprovenancescoresnotes
The scores object must contain:
correctnessusefulnessspecificityevidence_handling
Raw output capture
To make runs comparable and re-auditable, each per-question object should also include raw command output when practical:
- For query runs: add
raw_query_output - For chat runs: add
raw_chat_output
These may be omitted only if the artifact is marked partial.
Canonical template
{
"schema_version": 1,
"evaluation_plan": "EVALUATION.md",
"run_label": "raylib-baseline",
"status": "complete",
"created_at": "2026-04-19T15:00:00-07:00",
"repository": {
"name": "raylib",
"branch": "master",
"commit": "<git-sha>"
},
"hypothesis": "<what this run is testing>",
"run": {
"base": "ci-raylib-20260419",
"mode": "non_interactive_init",
"commands": [
"kb init --base ci-raylib-20260419 --non-interactive",
"kb query '<question>' --base ci-raylib-20260419 --output json (x8)",
"kb publish jekyll --base ci-raylib-20260419 --dir ~/raylib-kb-docs/ --apply"
],
"init_result": {
"status": "accepted",
"written_docs": 0,
"written_doc_ids": [],
"init_run_id": null,
"init_run_id_note": null,
"docs_list": { "documents": [] },
"graph_summary": { "entities": 0, "relationships": 0 }
}
},
"question_set": [],
"query_evaluation": [],
"chat_evaluation": { "status": "not_captured", "notes": "" },
"aggregate_scores": {
"query": {
"mean_correctness": 0,
"mean_usefulness": 0,
"mean_specificity": 0,
"mean_evidence_handling": 0,
"pass_rate_correctness_and_usefulness_at_least_3": 0
},
"chat": {
"mean_correctness": null,
"mean_usefulness": null,
"mean_specificity": null,
"mean_evidence_handling": null,
"pass_rate_correctness_and_usefulness_at_least_3": null
},
"combined": {
"mean_correctness": 0,
"mean_usefulness": 0,
"mean_specificity": 0,
"mean_evidence_handling": 0,
"pass_rate_correctness_and_usefulness_at_least_3": 0
}
},
"qualitative_findings": [],
"next_improvement_areas": []
}
Authoring rule
Agents should not invent their own artifact shape for future runs. If the schema needs to change:
- Update
schema_version - Update this section in
EVALUATION.md - Note the schema change in the artifact itself
Comparison Guidance
When comparing two runs:
- Keep
~/raylib/at the same git commit for both runs. - Use fresh
ci-raylib-*bases for both runs. - Reuse the same question set and scoring rubric.
- Prefer the same evaluator, or multiple evaluators with normalized scoring notes.
- Compare both machine metrics and human judgment.
Threats to Validity
- Repo familiarity may leak into interview answers and inflate results.
- The evaluator may know the correct answers already.
- LLM/provider drift may change answer quality across days.
- A single run can overfit to a lucky or unlucky
init. - Query quality may differ from chat quality; both must be measured separately.
Current Baseline
The reference raylib baseline artifact is:
evaluation/runs/2026-04-19-raylib-baseline.json- Init: 14 docs, 404 entities, 470 relationships, $0.025, 170s
- Query pass rate: 0.50 (5/8 hybrid retrieval; 1 tokenization-empty miss on install/build query)
That artifact is the reference point for the next comparison run.
Eval Type 2: Agent Token Efficiency Comparison
This is a separate evaluation from the init/query quality eval above. It tests whether having a KB actually reduces token usage during real implementation work.
Hypothesis
A KB-backed agent uses fewer tokens per task than a raw agent — and that advantage compounds over a task sequence. Each task the KB-backed agent completes deposits new facts into the raylib base. Future tasks find those facts via kb query instead of re-reading source files. Per-task token cost decreases as the base densifies.
Base
Use --base raylib (the persistent base, not ci-*). This base must survive across task sessions. The compounding effect only manifests when the same base is reused.
kb use --default raylib # set once before the task sequence begins
Task sequence (canonical)
Run these in order against ~/raylib/. Both agents work on the same task; the agent cannot reuse prior-task code between runs.
- Implement a flappy bird game in
~/raylib/examples/games/flappy_bird.c - Add parallax scrolling background to the flappy bird game
- Add a high score counter that persists between runs
- Add sound effects using raylib’s audio API
Each task is self-contained enough to run independently (agent starts fresh each time) but thematically connected so KB submissions from earlier tasks are useful to later ones.
Protocol
Agent A (raw):
- No
kbaccess - Discovers context by reading
~/raylib/src/, headers, examples, docs directly - No submissions after the task
Agent B (KB-backed):
- Has
kb queryavailable, baseraylib - Required to run at least one
kb querybefore writing code - Required to
kb submitat least one fact discovered during the task before finishing - May read source files too, but should prefer KB for known facts
Measurement
Use codeburn to capture per-session token counts:
codeburn report --provider claude --format json > /tmp/codeburn-task-N.json
Capture before and after each task. The metric is tokens consumed per task, not total.
Artifact
Store results under evaluation/runs/agent-compare/YYYY-MM-DD-task-N-<agent>.json.
Each artifact records:
task_id: 1–4agent:raworkb-backedbase:nullorraylibkb_queries_made: count (0 for raw agent)kb_submissions_made: count (0 for raw agent)codeburn_input_tokens: from codeburn reportcodeburn_output_tokens: from codeburn reportcodeburn_cost_usd: from codeburn reporttask_completed: booleannotes: free text
Success criteria
The KB-backed agent hypothesis is supported if:
- Agent B’s per-task token cost is lower than Agent A’s by task 3 or 4
- Agent B’s token cost curve slopes downward across the 4-task sequence
- Agent A’s token cost curve is flat or rising
A single session with better task 1 performance does not confirm the hypothesis — the compounding effect is the signal.