# KB Init Pipeline

`kb init` bootstraps a knowledge base from a repo. It runs **input collection** (README-like docs + optional source-code crawl), **`document-facts`** (deterministic sentence ingest from collected markdown into the `facts` table), **`code-facts`** (per-file LLM extraction into `import_code` facts), **`import-docs`** (one verbatim original SQLite doc per discovered markdown file), **`write`** (persist docs; with **`kb scan`** this stage also plans/applies claim mutations), **`pass-graph`** when enabled, and **`code-graph`** (deterministic AST indexing into `kg_*` tables). Use **`kb scan`** to refresh sources against an existing base.

## Input Collection

```mermaid
flowchart TD
    A[kb init] --> B[collectSourceFiles]
    A --> C[crawlSourceCode]

B --> B1["Fixed candidates\n(README, CLAUDE, AGENTS, …)"]
    B --> B2["Top-level *.md files\n(up to 8 total)"]
    B1 & B2 --> D[sourceFiles\nRecord<path, content>]

C --> C1["Walk repo tree\n(skip node_modules, dist, .git, …)"]
    C1 --> C2["Collect *.ts *.tsx *.js *.py *.go …\n(up to 200 files, 400 chars/file)"]
    C2 --> E[codeFiles\nRecord<path, snippet>]

D & E --> F[InitContext]
```

- `sourceFiles` — human-readable documentation files collected for **`import-docs`** (verbatim originals) and for **`document-facts`** / prompts.
- `codeFiles` — structural index of source code. Fed into **`code-facts`** extraction.

## Init cycles

```mermaid
flowchart TD
    A[kb init] --> R[read-inputs]
    R --> MF[document-facts]
    MF --> CF[code-facts]
    CF --> IM[import-docs]
    IM --> W[write]
    W --> PG[pass-graph]
    PG --> CG[code-graph]

MF --> MF1["Deterministic markdown\n→ facts import_doc"]
    CF --> CF1["Per-file LLM\n→ import_code facts"]
    IM --> IM1["One original doc\nper source file"]
    W --> W1["SQLite upsert\n+ scan planner"]
    PG --> PG1["Optional graph\nextract to SQLite"]
    CG --> CG1["AST indexing\n→ kg_* tables + semantic bridge"]
```

## Jekyll Routing

After `kb publish jekyll`, documents are routed by provenance:

| `is_original` | Collection | Directory |
|---|---|---|
| `1` | `original_docs` | `_original_docs/` |
| `0` | `autogenerated_docs` | `_autogenerated_docs/` |

Originals are **frozen snapshots** of source files (we do not rewrite or “correct” them in the KB pipeline). Autogenerated docs are the **curated layer** we refine for retrieval and demos; they may overlap originals on purpose.

README.md is excluded from both — it is the site homepage (`docs/index.md`).

## Title Conventions

| Context | Rule | Example |
|---|---|---|
| Autogenerated docs | Cap Every Word (`toTitleCase`) | `"Project Overview"` |
| Original/source docs | Basename as-is (`basenameTitle`) | `"CLAUDE.md"` |
| Jekyll filenames | Slugified basename, no extension | `claude-instructions.md` |

## Facts extracted from written documents

When the init pipeline (or any path using **`SqliteDocumentWriter`**) persists markdown documents, the writer **indexes candidate facts** from document bodies (deterministic sentence segmentation, length filters, and capped inserts into the **`facts`** table). That is **incremental** fact growth alongside init; see **`facts-architecture.md`** §2 / §7 for the full ingest model.

## Code-derived facts

The **`code-facts`** cycle ([src/core/code-fact-extract.ts](code-fact-extract.ts)) makes source code a **first-class fact source** without mirroring the AST:

1. **Skeleton** — a deterministic regex pass per file extracts top-level exports, imports, and the leading doc block. The skeleton is used only as **prompt context** and as the **anchor namespace**; it does not write fact rows.
2. **LLM semantic pass** — one structured JSON call per file (system prompt: [src/prompts/code-fact-extract.md](../prompts/code-fact-extract.md)) returns a `module_summary` plus up to **`KB_CODE_FACTS_MAX_PER_FILE`** semantic facts. Each fact carries a `triplet` (subject/predicate/object) and an `anchor` that is either `module` or one of the symbols from the skeleton.
3. **Repair-friendly upsert** — every row is stored with `source_kind = 'import_code'` and `source_ref = code:<path>@<anchor>#<contentHash>`. On re-run we group prior rows by `<path>@<anchor>` (via `SqliteKbIndexer.listActiveFactsBySourceRefPrefix`) and:
   - identical normalized text → no-op (dedupe in `upsertFact`),
   - reworded text → tombstone the old rows and insert the new one with `supersedes_fact_id`,
   - anchor missing in the new payload → tombstone all rows for that anchor.
4. **Scan** — `kb scan` reads `code-facts-manifest.json` (per-base sidecar) and only re-extracts files whose `sha256` changed. The per-anchor diff guarantees that unchanged files don't churn the `facts` table.

Budget knobs (env, sane defaults): `KB_CODE_FACTS_MAX_FILES=40`, `KB_CODE_FACTS_PER_FILE_CHARS=6000`, `KB_CODE_FACTS_MAX_CONCURRENCY=4`, `KB_CODE_FACTS_MAX_PER_FILE=8`. The graph builder (`rebuildFactGraph`) consumes the new triples directly — there is **no separate AST table**.

## Code-graph cycle (cycle 7)

The **`code-graph`** cycle runs at the end of every `kb init` and `kb scan`. It performs deterministic AST indexing of every file in the repo — no LLM involved — and writes results into `kg_*` tables in the same `.kb-index.sqlite` database.

### Two indexers

| Indexer | Files | When it runs |
|---|---|---|
| `TsMorphIndexer` (`src/tools/code-graph-indexer.ts`) | `.ts`, `.tsx`, `.js`, `.jsx` | When `tsconfig.json` is found; uses the TypeScript compiler for type-aware analysis |
| `TreeSitterIndexer` (`src/tools/tree-sitter-indexer.ts`) | All other files | Always; uses WASM grammars via `web-tree-sitter` |

When both run (TS project with Go files), `TsMorphIndexer` runs first and produces richer TS data (type-aware import resolution, `EXTENDS`/`IMPLEMENTS` edges). `TreeSitterIndexer` then covers everything else and uses `ON CONFLICT DO UPDATE` so it never overwrites TsMorphIndexer's richer TS nodes.

### File handling by extension

`TreeSitterIndexer` uses an explicit **allowlist** — not a denylist:

- **AST-able** (`.go`, `.ts`, `.tsx`, `.js`, `.jsx`, `.mjs`, `.cjs`) → full parse: file node + symbol nodes + import/export edges
- **Text/config** (`.md`, `.yaml`, `.json`, `.toml`, `.sql`, `.sh`, `.tf`, etc.) → file node only, `language='text'`, no symbols
- **Everything else** (images, binaries, lock files, compiled artifacts) → silently ignored

The supported Go, TypeScript, TSX, and JavaScript grammars ship as `.wasm` files in their respective `tree-sitter-*` npm packages — no native compilation needed, works on all platforms.

### What gets written

```sql
kg_nodes           — file nodes (kind='file') and symbol nodes (kind='symbol')
kg_edges           — IMPORTS_FILE, EXPORTS_SYMBOL, EXTENDS, IMPLEMENTS edges
kg_nodes_fts       — FTS index over node names and paths
kg_file_state      — per-file content hash for incremental skip on re-run
kg_semantic_bridge — name-matched links between kg_nodes symbols and kb_graph_entities
```

### Semantic bridge

After indexing, both indexers populate `kg_semantic_bridge` by slugifying symbol names and matching them against existing `kb_graph_entities`. A match (confidence 0.8) creates a row linking the code node to the semantic entity. This is what allows `expandWithCodeNeighbors` in `CodeGraphStore` to answer "what source files are relevant to entity X?" without any LLM involvement — it follows the bridge then traverses `IMPORTS_FILE` edges one hop out.

### Incremental behaviour

Both indexers store a `content_hash` per file in `kg_file_state`. On re-run (including `kb scan`), files whose hash hasn't changed are counted as `skipped` and not re-processed. Only changed or new files are re-indexed.

## Configuration Constants

| Constant | Value | Purpose |
|---|---|---|
| `MAX_SOURCE_SIZE` | 20 000 chars | Per-file cap for documentation files |
| `SOURCE_CODE_PER_FILE_CHARS` | 400 chars | Per-file snippet length for code crawl |
| `SOURCE_CODE_MAX_FILES` | 200 | Max source files indexed |
| `SOURCE_CODE_MAX_TOTAL_CHARS` | 60 000 chars | Total code index budget |
| `INIT_SOURCE_SHARD_MAX_FILES` | (see code) | Max shards when expanding |
| `INIT_SOURCE_SHARD_MAX_CHARS` | 8 000 chars | Per-shard content cap |

KB Init Pipeline

kb init bootstraps a knowledge base from a repo. It runs input collection (README-like docs + optional source-code crawl), document-facts (deterministic sentence ingest from collected markdown into the facts table), code-facts (per-file LLM extraction into import_code facts), import-docs (one verbatim original SQLite doc per discovered markdown file), write (persist docs; with kb scan this stage also plans/applies claim mutations), pass-graph when enabled, and code-graph (deterministic AST indexing into kg_* tables). Use kb scan to refresh sources against an existing base.

Input Collection

flowchart TD
    A[kb init] --> B[collectSourceFiles]
    A --> C[crawlSourceCode]

    B --> B1["Fixed candidates\n(README, CLAUDE, AGENTS, …)"]
    B --> B2["Top-level *.md files\n(up to 8 total)"]
    B1 & B2 --> D[sourceFiles\nRecord<path, content>]

    C --> C1["Walk repo tree\n(skip node_modules, dist, .git, …)"]
    C1 --> C2["Collect *.ts *.tsx *.js *.py *.go …\n(up to 200 files, 400 chars/file)"]
    C2 --> E[codeFiles\nRecord<path, snippet>]

    D & E --> F[InitContext]

sourceFiles — human-readable documentation files collected for import-docs (verbatim originals) and for document-facts / prompts.
codeFiles — structural index of source code. Fed into code-facts extraction.

Init cycles

flowchart TD
    A[kb init] --> R[read-inputs]
    R --> MF[document-facts]
    MF --> CF[code-facts]
    CF --> IM[import-docs]
    IM --> W[write]
    W --> PG[pass-graph]
    PG --> CG[code-graph]

    MF --> MF1["Deterministic markdown\n→ facts import_doc"]
    CF --> CF1["Per-file LLM\n→ import_code facts"]
    IM --> IM1["One original doc\nper source file"]
    W --> W1["SQLite upsert\n+ scan planner"]
    PG --> PG1["Optional graph\nextract to SQLite"]
    CG --> CG1["AST indexing\n→ kg_* tables + semantic bridge"]

Jekyll Routing

After kb publish jekyll, documents are routed by provenance:

`is_original`	Collection	Directory
`1`	`original_docs`	`_original_docs/`
`0`	`autogenerated_docs`	`_autogenerated_docs/`

Originals are frozen snapshots of source files (we do not rewrite or “correct” them in the KB pipeline). Autogenerated docs are the curated layer we refine for retrieval and demos; they may overlap originals on purpose.

README.md is excluded from both — it is the site homepage (docs/index.md).

Title Conventions

Context	Rule	Example
Autogenerated docs	Cap Every Word (`toTitleCase`)	`"Project Overview"`
Original/source docs	Basename as-is (`basenameTitle`)	`"CLAUDE.md"`
Jekyll filenames	Slugified basename, no extension	`claude-instructions.md`

Facts extracted from written documents

When the init pipeline (or any path using SqliteDocumentWriter) persists markdown documents, the writer indexes candidate facts from document bodies (deterministic sentence segmentation, length filters, and capped inserts into the facts table). That is incremental fact growth alongside init; see facts-architecture.md §2 / §7 for the full ingest model.

Code-derived facts

The code-facts cycle (src/core/code-fact-extract.ts) makes source code a first-class fact source without mirroring the AST:

Skeleton — a deterministic regex pass per file extracts top-level exports, imports, and the leading doc block. The skeleton is used only as prompt context and as the anchor namespace; it does not write fact rows.
LLM semantic pass — one structured JSON call per file (system prompt: src/prompts/code-fact-extract.md) returns a module_summary plus up to KB_CODE_FACTS_MAX_PER_FILE semantic facts. Each fact carries a triplet (subject/predicate/object) and an anchor that is either module or one of the symbols from the skeleton.
Repair-friendly upsert — every row is stored with source_kind = 'import_code' and source_ref = code:<path>@<anchor>#<contentHash>. On re-run we group prior rows by <path>@<anchor> (via SqliteKbIndexer.listActiveFactsBySourceRefPrefix) and:
- identical normalized text → no-op (dedupe in upsertFact),
- reworded text → tombstone the old rows and insert the new one with supersedes_fact_id,
- anchor missing in the new payload → tombstone all rows for that anchor.
Scan — kb scan reads code-facts-manifest.json (per-base sidecar) and only re-extracts files whose sha256 changed. The per-anchor diff guarantees that unchanged files don’t churn the facts table.

Budget knobs (env, sane defaults): KB_CODE_FACTS_MAX_FILES=40, KB_CODE_FACTS_PER_FILE_CHARS=6000, KB_CODE_FACTS_MAX_CONCURRENCY=4, KB_CODE_FACTS_MAX_PER_FILE=8. The graph builder (rebuildFactGraph) consumes the new triples directly — there is no separate AST table.

Code-graph cycle (cycle 7)

The code-graph cycle runs at the end of every kb init and kb scan. It performs deterministic AST indexing of every file in the repo — no LLM involved — and writes results into kg_* tables in the same .kb-index.sqlite database.

Two indexers

Indexer	Files	When it runs
`TsMorphIndexer` (`src/tools/code-graph-indexer.ts`)	`.ts`, `.tsx`, `.js`, `.jsx`	When `tsconfig.json` is found; uses the TypeScript compiler for type-aware analysis
`TreeSitterIndexer` (`src/tools/tree-sitter-indexer.ts`)	All other files	Always; uses WASM grammars via `web-tree-sitter`

When both run (TS project with Go files), TsMorphIndexer runs first and produces richer TS data (type-aware import resolution, EXTENDS/IMPLEMENTS edges). TreeSitterIndexer then covers everything else and uses ON CONFLICT DO UPDATE so it never overwrites TsMorphIndexer’s richer TS nodes.

File handling by extension

TreeSitterIndexer uses an explicit allowlist — not a denylist:

AST-able (.go, .ts, .tsx, .js, .jsx, .mjs, .cjs) → full parse: file node + symbol nodes + import/export edges
Text/config (.md, .yaml, .json, .toml, .sql, .sh, .tf, etc.) → file node only, language='text', no symbols
Everything else (images, binaries, lock files, compiled artifacts) → silently ignored

The supported Go, TypeScript, TSX, and JavaScript grammars ship as .wasm files in their respective tree-sitter-* npm packages — no native compilation needed, works on all platforms.

What gets written

kg_nodes           — file nodes (kind='file') and symbol nodes (kind='symbol')
kg_edges           — IMPORTS_FILE, EXPORTS_SYMBOL, EXTENDS, IMPLEMENTS edges
kg_nodes_fts       — FTS index over node names and paths
kg_file_state      — per-file content hash for incremental skip on re-run
kg_semantic_bridge — name-matched links between kg_nodes symbols and kb_graph_entities

Semantic bridge

After indexing, both indexers populate kg_semantic_bridge by slugifying symbol names and matching them against existing kb_graph_entities. A match (confidence 0.8) creates a row linking the code node to the semantic entity. This is what allows expandWithCodeNeighbors in CodeGraphStore to answer “what source files are relevant to entity X?” without any LLM involvement — it follows the bridge then traverses IMPORTS_FILE edges one hop out.

Incremental behaviour

Both indexers store a content_hash per file in kg_file_state. On re-run (including kb scan), files whose hash hasn’t changed are counted as skipped and not re-processed. Only changed or new files are re-indexed.

Configuration Constants

Constant	Value	Purpose
`MAX_SOURCE_SIZE`	20 000 chars	Per-file cap for documentation files
`SOURCE_CODE_PER_FILE_CHARS`	400 chars	Per-file snippet length for code crawl
`SOURCE_CODE_MAX_FILES`	200	Max source files indexed
`SOURCE_CODE_MAX_TOTAL_CHARS`	60 000 chars	Total code index budget
`INIT_SOURCE_SHARD_MAX_FILES`	(see code)	Max shards when expanding
`INIT_SOURCE_SHARD_MAX_CHARS`	8 000 chars	Per-shard content cap