mirror of https://github.com/Tencent/WeKnora.git synced 2026-06-04 13:30:32 +08:00

Files

Claude 55bc5dcbe2 fix(chunking): post-review polish — error surfacing + doc accuracy

Three small follow-ups from the QA review of 13f57ca:

1. KBChunkingDebug error surfacing
   Previously a 200 OK response with { success: false, error: "..." }
   would have been swallowed under a generic "unexpected response shape"
   message. The strict-shape check now distinguishes empty response,
   success=false (surfaces resp.error directly), and missing data — so
   any future backend-side validation message reaches the user.

2. Token approximation in docs/CHUNKING.md
   Child Chunk Size default 384 ≈ 95 EN tokens (was rounded to 80).
   384 / 4 chars-per-token = 96; "~95" is the honest figure.

3. API surface in docs/CHUNKING.md
   The example only documented PUT /initialization/config/:kbId
   (camelCase, documentSplitting envelope). Added explicit notes that
   POST/PUT /knowledge-bases use snake_case under chunking_config, and
   that POST /chunker/preview also uses the snake_case form plus a text
   field. Readers picking the wrong endpoint won't be surprised by the
   case mismatch anymore.

https://claude.ai/code/session_01XADhx6mtu2ZYW3DE9Lun6k

2026-05-06 17:17:07 +08:00

8.2 KiB

Raw Blame History

Chunking Guide

How WeKnora splits uploaded documents before embedding, why the defaults are what they are, and when to change them.

Why chunking matters

Retrieval-Augmented Generation (RAG) works by embedding small slices of your documents into a vector index, then pulling the most relevant slices back at query time. The way a document is sliced — chunk size, overlap, where the cuts fall — directly drives retrieval recall and the quality of the answers your LLM produces.

Empirically (Vecta Feb-2026 benchmark across 50 academic papers): recursive splitting at ~512 tokens with ~15% overlap is the strongest single-knob baseline at 69% end-to-end accuracy, beating semantic chunking and over-engineered hybrids. WeKnora uses that as the foundation and layers smarter strategies on top when the document gives us structural cues.

Adaptive 3-tier chunking

Set per knowledge base via the editor's Chunking sidebar (or the strategy field on the KB-config API).

Strategy	When picked	What it does
`auto` (recommended)	Default for new KBs	Profiles the document and picks the strongest tier from the chain below.
`heading`	Markdown-style structure	Splits at `#` / `##` / `###` boundaries. Each chunk gets a breadcrumb context header (`# Top > ## Section`) prepended at embedding time.
`heuristic`	PDF-style structure	Splits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators.
`legacy` (= `recursive`)	Anything else, or as fallback	Pure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes.

A document profiler runs first and counts structural signals (Markdown headings, form-feeds, chapter markers per language, all-caps lines, visual separators, blank-line bursts). Auto-strategy picks the tier chain based on those counts; a validator rejects obviously broken output (e.g. the heading splitter producing 200 single-line chunks) and falls through to the next tier.

Settings reference

Core

Setting	Range	Default	Sweet spot for…
Chunk size	100–4000 chars	512	Default works for most cases. 200–400 for FAQs / atomic Q&A. 1000–2000 for narrative documents.
Chunk overlap	0–500 chars	80 (~15%)	0 for FAQs and structured records. 80 (default) for general documents. 150–200 for argumentative texts where reasoning crosses chunks.
Separators	string list	`["\n\n", "\n", "。", "！", "？", ";", "；"]`	Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize.

Parent-Child chunking

Two-level retrieval: child chunks (small, embedded for vector match) and parent chunks (larger, returned to the LLM for context).

Setting	Range	Default	Notes
Enable parent-child	toggle	on	Recommended for documents > 10 pages. Skip for short FAQs to halve storage cost.
Parent chunk size	512–8192 chars	4096 (~1000 EN tokens)	Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (1024–2048) for local LLMs with 4k contexts.
Child chunk size	64–2048 chars	384 (~95 EN tokens)	128–256 for Q&A-style precise matching. 512–1024 if your embedder accepts >1000 tokens (E5 / BGE-large).

Advanced

Setting	Range	Default	When to set
Token limit	0–8192	0 (off)	Activate when your embedding model has a small token cap. See table below.
Languages	`de` / `en` / `zh` (multi-select)	empty (auto-detect)	Set explicitly for homogeneous corpora to narrow heuristic patterns.

Token-limit guide per embedding model

Embedder	Token limit	Recommended `tokenLimit` setting
OpenAI `text-embedding-3-small/large`	8191	0 (leave off)
Anthropic Voyage-3	32000	0
Jina-embeddings-v3	8192	0
Cohere `embed-multilingual-v3`	512	400
BGE-base / BGE-large / E5-large	512	400
Sentence-Transformer `all-MiniLM-L6-v2`	256	200

Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens. Activate to 80% of the model's hard limit for smaller embedders so chunks always fit even for CJK content (which is denser per character).

Use-case presets

Workload	Strategy	ChunkSize	Overlap	Parent-Child
FAQ / Q&A knowledge base	`auto` (likely picks legacy)	200–400	0	off
Markdown documentation / wikis	`auto` (picks heading)	512	80	on
PDF reports with page breaks	`auto` (picks heuristic)	800–1200	100–150	on
Long-form narrative (books, articles)	`auto` (picks recursive)	1000–2000	150–200	on
Code documentation	`legacy`	800	100	optional
Mixed-language corpus	`auto`, languages = empty	512	80	on
Tabular reports / CSV-derived	`legacy`	400	0	off

Debugging in the UI

The KB editor's Chunking sidebar has a "Test with sample text" collapsible at the bottom:

Paste a Markdown / plain-text snippet (max 64 KB).
Click Run preview.
The panel shows:
- Selected strategy tier as a colored tag
- Tiers that were rejected and why (e.g. "too many tiny chunks")
- Document profile (heading counts, form-feeds, chapter markers, detected languages)
- Size statistics over the full chunk set (avg / min / max / stddev)
- Per-chunk cards with size in chars + approximate tokens, position range, the section breadcrumb (when set), and a content preview

This runs read-only against a goroutine-isolated splitter pass (5s timeout) — no DB writes, no embedding API calls. Use it to compare configurations against the same sample before triggering a re-upload.

API

Three endpoints can write the chunking config. The KB-config update endpoint is the one wired to the editor UI and uses camelCase with a documentSplitting envelope:

PUT /api/v1/initialization/config/:kbId
Authorization: Bearer <jwt>
Content-Type: application/json

{
  "documentSplitting": {
    "chunkSize": 512,
    "chunkOverlap": 80,
    "separators": ["\n\n", "\n", "。", "！", "？", ";", "；"],
    "strategy": "auto",
    "tokenLimit": 0,
    "languages": ["de", "en"],
    "enableParentChild": true,
    "parentChunkSize": 4096,
    "childChunkSize": 384
  }
}

The strategy, tokenLimit, and languages fields use pointer-based DTOs server-side: omitting them in the payload means "no change", sending an empty string / 0 / [] explicitly resets to default.

The KB CRUD endpoints (POST /api/v1/knowledge-bases, PUT /api/v1/knowledge-bases/:id) take the same fields but in snake_case under a chunking_config envelope: { "chunking_config": { "chunk_size": 512, "chunk_overlap": 80, "strategy": "auto", "token_limit": 0, "languages": ["de", "en"], ... } }.

The preview endpoint at POST /api/v1/chunker/preview uses the snake_case form too and additionally takes a text field for the sample to chunk.

Known trade-offs

Tier-1 heading-aware chunking prepends the section breadcrumb to the embedding input, costing ~5% more tokens per chunk in exchange for ~30–50% fewer chunks on structured documents (net token savings on storage and at query time).
Strategy switches do not auto-reindex existing documents. After changing a KB's strategy, re-upload affected files (or trigger re-indexing via the UI) to apply the new chunking.
OCR artifacts in PDFs (vertical layout text broken character- by-character into separate lines) cannot be fixed by any splitter — this is a parser-side limitation. The heuristic tier still keeps chunks aligned to page boundaries, which mitigates the worst cases.
The recursive strategy value exists in the API for completeness but is intentionally hidden from the UI: it is functionally near legacy and adding a fifth dropdown option dilutes the meaningful choice between automatic / Markdown / heuristic / legacy.

8.2 KiB Raw Blame History Unescape Escape