KB Init Pipeline
kb init bootstraps a knowledge base from a repo. It runs input collection (README-like docs + optional source-code crawl), document-facts (deterministic sentence ingest from collected markdown into the facts table), code-facts (per-file LLM extraction into import_code facts), import-docs (one verbatim original SQLite doc per discovered markdown file), write (persist docs; with kb scan this stage also plans/applies claim mutations), pass-graph when enabled, and code-graph (deterministic AST indexing into kg_* tables). Use kb scan to refresh sources against an existing base.
Input Collection
flowchart TD
A[kb init] --> B[collectSourceFiles]
A --> C[crawlSourceCode]
B --> B1["Fixed candidates\n(README, CLAUDE, AGENTS, …)"]
B --> B2["Top-level *.md files\n(up to 8 total)"]
B1 & B2 --> D[sourceFiles\nRecord<path, content>]
C --> C1["Walk repo tree\n(skip node_modules, dist, .git, …)"]
C1 --> C2["Collect *.ts *.tsx *.js *.py *.go …\n(up to 200 files, 400 chars/file)"]
C2 --> E[codeFiles\nRecord<path, snippet>]
D & E --> F[InitContext]
sourceFiles— human-readable documentation files collected forimport-docs(verbatim originals) and fordocument-facts/ prompts.codeFiles— structural index of source code. Fed intocode-factsextraction.
Init cycles
flowchart TD
A[kb init] --> R[read-inputs]
R --> MF[document-facts]
MF --> CF[code-facts]
CF --> IM[import-docs]
IM --> W[write]
W --> PG[pass-graph]
PG --> CG[code-graph]
MF --> MF1["Deterministic markdown\n→ facts import_doc"]
CF --> CF1["Per-file LLM\n→ import_code facts"]
IM --> IM1["One original doc\nper source file"]
W --> W1["SQLite upsert\n+ scan planner"]
PG --> PG1["Optional graph\nextract to SQLite"]
CG --> CG1["AST indexing\n→ kg_* tables + semantic bridge"]
Jekyll Routing
After kb publish jekyll, documents are routed by provenance:
is_original |
Collection | Directory |
|---|---|---|
1 |
original_docs |
_original_docs/ |
0 |
autogenerated_docs |
_autogenerated_docs/ |
Originals are frozen snapshots of source files (we do not rewrite or “correct” them in the KB pipeline). Autogenerated docs are the curated layer we refine for retrieval and demos; they may overlap originals on purpose.
README.md is excluded from both — it is the site homepage (docs/index.md).
Title Conventions
| Context | Rule | Example |
|---|---|---|
| Autogenerated docs | Cap Every Word (toTitleCase) |
"Project Overview" |
| Original/source docs | Basename as-is (basenameTitle) |
"CLAUDE.md" |
| Jekyll filenames | Slugified basename, no extension | claude-instructions.md |
Facts extracted from written documents
When the init pipeline (or any path using SqliteDocumentWriter) persists markdown documents, the writer indexes candidate facts from document bodies (deterministic sentence segmentation, length filters, and capped inserts into the facts table). That is incremental fact growth alongside init; see facts-architecture.md §2 / §7 for the full ingest model.
Code-derived facts
The code-facts cycle (src/core/code-fact-extract.ts) makes source code a first-class fact source without mirroring the AST:
- Skeleton — a deterministic regex pass per file extracts top-level exports, imports, and the leading doc block. The skeleton is used only as prompt context and as the anchor namespace; it does not write fact rows.
- LLM semantic pass — one structured JSON call per file (system prompt: src/prompts/code-fact-extract.md) returns a
module_summaryplus up toKB_CODE_FACTS_MAX_PER_FILEsemantic facts. Each fact carries atriplet(subject/predicate/object) and ananchorthat is eithermoduleor one of the symbols from the skeleton. - Repair-friendly upsert — every row is stored with
source_kind = 'import_code'andsource_ref = code:<path>@<anchor>#<contentHash>. On re-run we group prior rows by<path>@<anchor>(viaSqliteKbIndexer.listActiveFactsBySourceRefPrefix) and:- identical normalized text → no-op (dedupe in
upsertFact), - reworded text → tombstone the old rows and insert the new one with
supersedes_fact_id, - anchor missing in the new payload → tombstone all rows for that anchor.
- identical normalized text → no-op (dedupe in
- Scan —
kb scanreadscode-facts-manifest.json(per-base sidecar) and only re-extracts files whosesha256changed. The per-anchor diff guarantees that unchanged files don’t churn thefactstable.
Budget knobs (env, sane defaults): KB_CODE_FACTS_MAX_FILES=40, KB_CODE_FACTS_PER_FILE_CHARS=6000, KB_CODE_FACTS_MAX_CONCURRENCY=4, KB_CODE_FACTS_MAX_PER_FILE=8. The graph builder (rebuildFactGraph) consumes the new triples directly — there is no separate AST table.
Code-graph cycle (cycle 7)
The code-graph cycle runs at the end of every kb init and kb scan. It performs deterministic AST indexing of every file in the repo — no LLM involved — and writes results into kg_* tables in the same .kb-index.sqlite database.
Two indexers
| Indexer | Files | When it runs |
|---|---|---|
TsMorphIndexer (src/tools/code-graph-indexer.ts) |
.ts, .tsx, .js, .jsx |
When tsconfig.json is found; uses the TypeScript compiler for type-aware analysis |
TreeSitterIndexer (src/tools/tree-sitter-indexer.ts) |
All other files | Always; uses WASM grammars via web-tree-sitter |
When both run (TS project with Go files), TsMorphIndexer runs first and produces richer TS data (type-aware import resolution, EXTENDS/IMPLEMENTS edges). TreeSitterIndexer then covers everything else and uses ON CONFLICT DO UPDATE so it never overwrites TsMorphIndexer’s richer TS nodes.
File handling by extension
TreeSitterIndexer uses an explicit allowlist — not a denylist:
- AST-able (
.go,.ts,.tsx,.js,.jsx,.mjs,.cjs) → full parse: file node + symbol nodes + import/export edges - Text/config (
.md,.yaml,.json,.toml,.sql,.sh,.tf, etc.) → file node only,language='text', no symbols - Everything else (images, binaries, lock files, compiled artifacts) → silently ignored
The supported Go, TypeScript, TSX, and JavaScript grammars ship as .wasm files in their respective tree-sitter-* npm packages — no native compilation needed, works on all platforms.
What gets written
kg_nodes — file nodes (kind='file') and symbol nodes (kind='symbol')
kg_edges — IMPORTS_FILE, EXPORTS_SYMBOL, EXTENDS, IMPLEMENTS edges
kg_nodes_fts — FTS index over node names and paths
kg_file_state — per-file content hash for incremental skip on re-run
kg_semantic_bridge — name-matched links between kg_nodes symbols and kb_graph_entities
Semantic bridge
After indexing, both indexers populate kg_semantic_bridge by slugifying symbol names and matching them against existing kb_graph_entities. A match (confidence 0.8) creates a row linking the code node to the semantic entity. This is what allows expandWithCodeNeighbors in CodeGraphStore to answer “what source files are relevant to entity X?” without any LLM involvement — it follows the bridge then traverses IMPORTS_FILE edges one hop out.
Incremental behaviour
Both indexers store a content_hash per file in kg_file_state. On re-run (including kb scan), files whose hash hasn’t changed are counted as skipped and not re-processed. Only changed or new files are re-indexed.
Configuration Constants
| Constant | Value | Purpose |
|---|---|---|
MAX_SOURCE_SIZE |
20 000 chars | Per-file cap for documentation files |
SOURCE_CODE_PER_FILE_CHARS |
400 chars | Per-file snippet length for code crawl |
SOURCE_CODE_MAX_FILES |
200 | Max source files indexed |
SOURCE_CODE_MAX_TOTAL_CHARS |
60 000 chars | Total code index budget |
INIT_SOURCE_SHARD_MAX_FILES |
(see code) | Max shards when expanding |
INIT_SOURCE_SHARD_MAX_CHARS |
8 000 chars | Per-shard content cap |