# KB Init Pipeline

`kb init` bootstraps a knowledge base from a repo. It runs **input collection** (README-like docs + optional source-code crawl), **`document-facts`** (deterministic sentence ingest from collected markdown into the `facts` table), **`code-facts`** (per-file LLM extraction into `import_code` facts), **`import-docs`** (one verbatim original SQLite doc per discovered markdown file), **`write`** (persist docs; with **`kb scan`** this stage also plans/applies claim mutations), **`pass-graph`** when enabled, and **`code-graph`** (deterministic AST indexing into `kg_*` tables). Use **`kb scan`** to refresh sources against an existing base.

## Input Collection

```mermaid
flowchart TD
    A[kb init] --> B[collectSourceFiles]
    A --> C[crawlSourceCode]

    B --> B1["Fixed candidates\n(README, CLAUDE, AGENTS, …)"]
    B --> B2["Top-level *.md files\n(up to 8 total)"]
    B1 & B2 --> D[sourceFiles\nRecord<path, content>]

    C --> C1["Walk repo tree\n(skip node_modules, dist, .git, …)"]
    C1 --> C2["Collect *.ts *.tsx *.js *.py *.go …\n(up to 200 files, 400 chars/file)"]
    C2 --> E[codeFiles\nRecord<path, snippet>]

    D & E --> F[InitContext]
```

- `sourceFiles` — human-readable documentation files collected for **`import-docs`** (verbatim originals) and for **`document-facts`** / prompts.
- `codeFiles` — structural index of source code. Fed into **`code-facts`** extraction.

## Init cycles

```mermaid
flowchart TD
    A[kb init] --> R[read-inputs]
    R --> MF[document-facts]
    MF --> CF[code-facts]
    CF --> IM[import-docs]
    IM --> W[write]
    W --> PG[pass-graph]
    PG --> CG[code-graph]

    MF --> MF1["Deterministic markdown\n→ facts import_doc"]
    CF --> CF1["Per-file LLM\n→ import_code facts"]
    IM --> IM1["One original doc\nper source file"]
    W --> W1["SQLite upsert\n+ scan planner"]
    PG --> PG1["Optional graph\nextract to SQLite"]
    CG --> CG1["AST indexing\n→ kg_* tables + semantic bridge"]
```

## Jekyll Routing

After `kb publish jekyll`, documents are routed by provenance:

| `is_original` | Collection | Directory |
|---|---|---|
| `1` | `original_docs` | `_original_docs/` |
| `0` | `autogenerated_docs` | `_autogenerated_docs/` |

Originals are **frozen snapshots** of source files (we do not rewrite or “correct” them in the KB pipeline). Autogenerated docs are the **curated layer** we refine for retrieval and demos; they may overlap originals on purpose.

README.md is excluded from both — it is the site homepage (`docs/index.md`).

## Title Conventions

| Context | Rule | Example |
|---|---|---|
| Autogenerated docs | Cap Every Word (`toTitleCase`) | `"Project Overview"` |
| Original/source docs | Basename as-is (`basenameTitle`) | `"CLAUDE.md"` |
| Jekyll filenames | Slugified basename, no extension | `claude-instructions.md` |

## Facts extracted from written documents

When the init pipeline (or any path using **`SqliteDocumentWriter`**) persists markdown documents, the writer **indexes candidate facts** from document bodies (deterministic sentence segmentation, length filters, and capped inserts into the **`facts`** table). That is **incremental** fact growth alongside init; see **`facts-architecture.md`** §2 / §7 for the full ingest model.

## Code-derived facts

The **`code-facts`** cycle ([src/core/code-fact-extract.ts](code-fact-extract.ts)) makes source code a **first-class fact source** without mirroring the AST:

1. **Skeleton** — a deterministic regex pass per file extracts top-level exports, imports, and the leading doc block. The skeleton is used only as **prompt context** and as the **anchor namespace**; it does not write fact rows.
2. **LLM semantic pass** — one structured JSON call per file (system prompt: [src/prompts/code-fact-extract.md](../prompts/code-fact-extract.md)) returns a `module_summary` plus up to **`KB_CODE_FACTS_MAX_PER_FILE`** semantic facts. Each fact carries a `triplet` (subject/predicate/object) and an `anchor` that is either `module` or one of the symbols from the skeleton.
3. **Repair-friendly upsert** — every row is stored with `source_kind = 'import_code'` and `source_ref = code:<path>@<anchor>#<contentHash>`. On re-run we group prior rows by `<path>@<anchor>` (via `SqliteKbIndexer.listActiveFactsBySourceRefPrefix`) and:
   - identical normalized text → no-op (dedupe in `upsertFact`),
   - reworded text → tombstone the old rows and insert the new one with `supersedes_fact_id`,
   - anchor missing in the new payload → tombstone all rows for that anchor.
4. **Scan** — `kb scan` reads `code-facts-manifest.json` (per-base sidecar) and only re-extracts files whose `sha256` changed. The per-anchor diff guarantees that unchanged files don't churn the `facts` table.

Budget knobs (env, sane defaults): `KB_CODE_FACTS_MAX_FILES=40`, `KB_CODE_FACTS_PER_FILE_CHARS=6000`, `KB_CODE_FACTS_MAX_CONCURRENCY=4`, `KB_CODE_FACTS_MAX_PER_FILE=8`. The graph builder (`rebuildFactGraph`) consumes the new triples directly — there is **no separate AST table**.

## Code-graph cycle (cycle 7)

The **`code-graph`** cycle runs at the end of every `kb init` and `kb scan`. It performs deterministic AST indexing of every file in the repo — no LLM involved — and writes results into `kg_*` tables in the same `.kb-index.sqlite` database.

### Two indexers

| Indexer | Files | When it runs |
|---|---|---|
| `TsMorphIndexer` (`src/tools/code-graph-indexer.ts`) | `.ts`, `.tsx`, `.js`, `.jsx` | When `tsconfig.json` is found; uses the TypeScript compiler for type-aware analysis |
| `TreeSitterIndexer` (`src/tools/tree-sitter-indexer.ts`) | All other files | Always; uses WASM grammars via `web-tree-sitter` |

When both run (TS project with Go files), `TsMorphIndexer` runs first and produces richer TS data (type-aware import resolution, `EXTENDS`/`IMPLEMENTS` edges). `TreeSitterIndexer` then covers everything else and uses `ON CONFLICT DO UPDATE` so it never overwrites TsMorphIndexer's richer TS nodes.

### File handling by extension

`TreeSitterIndexer` uses an explicit **allowlist** — not a denylist:

- **AST-able** (`.go`, `.ts`, `.tsx`, `.js`, `.jsx`, `.mjs`, `.cjs`) → full parse: file node + symbol nodes + import/export edges
- **Text/config** (`.md`, `.yaml`, `.json`, `.toml`, `.sql`, `.sh`, `.tf`, etc.) → file node only, `language='text'`, no symbols
- **Everything else** (images, binaries, lock files, compiled artifacts) → silently ignored

The supported Go, TypeScript, TSX, and JavaScript grammars ship as `.wasm` files in their respective `tree-sitter-*` npm packages — no native compilation needed, works on all platforms.

### What gets written

```sql
kg_nodes           — file nodes (kind='file') and symbol nodes (kind='symbol')
kg_edges           — IMPORTS_FILE, EXPORTS_SYMBOL, EXTENDS, IMPLEMENTS edges
kg_nodes_fts       — FTS index over node names and paths
kg_file_state      — per-file content hash for incremental skip on re-run
kg_semantic_bridge — name-matched links between kg_nodes symbols and kb_graph_entities
```

### Semantic bridge

After indexing, both indexers populate `kg_semantic_bridge` by slugifying symbol names and matching them against existing `kb_graph_entities`. A match (confidence 0.8) creates a row linking the code node to the semantic entity. This is what allows `expandWithCodeNeighbors` in `CodeGraphStore` to answer "what source files are relevant to entity X?" without any LLM involvement — it follows the bridge then traverses `IMPORTS_FILE` edges one hop out.

### Incremental behaviour

Both indexers store a `content_hash` per file in `kg_file_state`. On re-run (including `kb scan`), files whose hash hasn't changed are counted as `skipped` and not re-processed. Only changed or new files are re-indexed.

## Configuration Constants

| Constant | Value | Purpose |
|---|---|---|
| `MAX_SOURCE_SIZE` | 20 000 chars | Per-file cap for documentation files |
| `SOURCE_CODE_PER_FILE_CHARS` | 400 chars | Per-file snippet length for code crawl |
| `SOURCE_CODE_MAX_FILES` | 200 | Max source files indexed |
| `SOURCE_CODE_MAX_TOTAL_CHARS` | 60 000 chars | Total code index budget |
| `INIT_SOURCE_SHARD_MAX_FILES` | (see code) | Max shards when expanding |
| `INIT_SOURCE_SHARD_MAX_CHARS` | 8 000 chars | Per-shard content cap |
