docs(chunking): align overlap default + sharpen UI/source/repo docs

Three documentation passes around the adaptive chunking work: UI - Frontend ChunkOverlap default consolidated to 80 (was 100), matching chunker.DefaultChunkOverlap on the backend. Both DEFAULT_CHUNKING_PRESET and initFormData updated. The KB-load fallback also uses 80 when a loaded KB has no chunk_overlap stored. - All four locales (en-US, zh-CN, ko-KR, ru-RU) get rewritten chunking setting descriptions: each now states the validated range, the default, and the situations where you'd deviate (FAQ vs narrative, embedder token limits, language-specific corpora). Source code - splitter.go: DefaultChunkSize / DefaultChunkOverlap constants get a longer block-comment explaining the per-language token math and the use-case sweet spots, plus the migration note on what the old inconsistent defaults were. - KBChunkingSettings.vue: new comment block above ChunkingConfig documents the slider min/max for each setting, why those bounds exist, and the recommended TokenLimit values per embedding model. Repo docs - New docs/CHUNKING.md: end-to-end guide covering why chunking matters, the adaptive 3-tier architecture, per-setting reference with ranges and sweet spots, parent-child explanation, the token-limit table per embedder (OpenAI / Voyage / Cohere / BGE / MiniLM / Jina), 7 use-case presets, the debug panel workflow, the API surface, and known trade-offs (recursive strategy hidden from UI, no auto-reindex on strategy switch, OCR limitations). - CHANGELOG.md gets a new [Unreleased] section consolidating all the adaptive-chunking work shipped on this branch: 5 features, 8 improvements, 6 fixes, 1 docs entry. The entry references docs/CHUNKING.md for deeper explanation. https://claude.ai/code/session_01XADhx6mtu2ZYW3DE9Lun6k
2026-06-04 13:30:32 +08:00 · 2026-05-03 11:57:03 +00:00
parent 8fd06b1e07
commit 13f57caa3e
9 changed files with 277 additions and 52 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,37 @@
 All notable changes to this project will be documented in this file.
 ## [Unreleased]
 ### 🚀 New Features
 - **NEW**: Adaptive 3-tier chunking — documents are now profiled before splitting and routed to one of three strategies: heading-aware (Markdown structure), heuristic (form-feeds, multilingual chapter markers DE/EN/ZH, all-caps titles, visual separators), or recursive (the modernized legacy splitter as a fallback). Auto-strategy is the new default for fresh KBs; existing KBs keep their previous behavior until the user opts in. See `docs/CHUNKING.md`.
 - **NEW**: KB editor — chunking settings panel surfaces the new strategy selector (Automatic / Markdown-optimized / Smart structure detection / Classic) plus advanced options for token limit per chunk and language hints. Sharper inline help text on every setting explains when defaults apply and when to tune.
 - **NEW**: Chunking debug panel — embedded "Test with sample text" panel under the chunking settings. Paste a snippet, hit Run preview, see selected tier, rejected tiers + reasons, document profile, size distribution stats over the full chunk set, and per-chunk cards with breadcrumb + content preview. Read-only, no DB or embedding side effects, 5-second server-side timeout.
 - **NEW**: `POST /api/v1/chunker/preview` endpoint backing the debug panel. Returns `selected_tier`, `tier_chain`, `rejected[]`, `profile`, `chunks[]`, and `stats`. Capped at 64k input runes / 500 chunks per response.
 - **NEW**: Per-tenant RRF (Reciprocal Rank Fusion) tuning — `RRFK`, `RRFVectorWeight`, `RRFKeywordWeight` are now configurable on the tenant `RetrievalConfig`. Defaults preserve the previous hardcoded behavior (k=60, weights 0.7/0.3).
 ### ⚡ Improvements
 - **IMPROVED**: Chunker recursive priority — `splitBySeparators` now genuinely walks separators by priority and recursively re-splits oversize sub-pieces with the next-priority separator. Mirrors the Python reference. Without this fix, a "one paragraph break followed by a long run of newline-separated lines" pattern could emit ~1900-rune chunks at chunkSize=300.
 - **IMPROVED**: ChunkOverlap default consolidated to 80 (~15% of ChunkSize). Previously the Go DefaultConfig used 64, the knowledge service used 50, the Python docreader used 100, and the frontend form initialised to 100. All paths now align.
 - **IMPROVED**: ContextHeader (Markdown breadcrumb) lives on `Chunk.ContextHeader`, separate from `Chunk.Content`. Restores the `End-Start == len(Content)` invariant that the document-reconstruction path in `knowledge.go` relies on for summary generation and UI highlighting. Eliminates a duplicate-heading regression where the section heading appeared twice in a chunk's body.
 - **IMPROVED**: Embedding pipeline — exponential backoff (200/400/800/1600/3200 ms) replaces the previous fixed 100ms × 5 retry loop, with context-cancellation between attempts. `sanitizeForEmbedding` caps single embedding inputs at 20k runes with a warning log on overflow.
 - **IMPROVED**: SplitParentChild forces children onto the recursive tier, skipping per-parent profile passes (previously paid N extra O(N) document scans).
 - **IMPROVED**: Heuristic splitter snaps overlap start to the nearest semantic boundary or newline instead of slicing mid-line / mid-word.
 - **IMPROVED**: Validator flow — when every tier is rejected, the chain returns the legacy tier's output directly instead of running SplitText a second time.
 - **IMPROVED**: Token limit per chunk — when set, ChunkSize is auto-clamped to a per-language character budget (with a 10% safety factor). Prevents overshooting embedding model token caps on CJK content where 1 char ≈ 0.6 tokens.
 - **IMPROVED**: KB-config API — `strategy`, `tokenLimit`, `languages` use pointer DTOs server-side so a payload omitting a field means "no change" while an explicit empty / 0 / [] resets to default. Previously these were write-once fields.
 ### 🐛 Bug Fixes
 - **FIXED**: Chunker — `Chunk.Start` / `End` rune-offset invariant restored after the heading-aware splitter started prepending breadcrumbs to content (regression introduced during the initial Tier-1 work, fixed before any release).
 - **FIXED**: Heuristic splitter — `applyOverlap` aligns to boundaries instead of doing blind char-subtraction that could leave chunks starting mid-word in CJK text.
 - **FIXED**: Preview endpoint — chunk-size statistics are now computed over the FULL chunk set before truncating the response payload to 500 entries. Previously `avg`/`min`/`max`/`stddev` reflected only the first 500 chunks of a larger split.
 - **FIXED**: Preview endpoint — empty / whitespace-only sample text now returns a friendly 400 ("paste a sample…") instead of gin's cryptic `Field validation failed` error.
 - **FIXED**: Frontend chunking debug panel — added explicit `type="button"`, prominent loading and error states, and console error logging so failed previews are debuggable from DevTools without enabling verbose logging. Earlier the panel could appear to "vanish" silently when a request failed.
 - **FIXED**: KB-editor i18n — `chunkOverlap` initial form value aligned with the backend default (80, not 100); description texts on every chunking setting now state the recommended ranges per use-case.
 ### 📚 Documentation
 - **DOC**: New `docs/CHUNKING.md` — strategy explanations, settings reference with use-case presets, token-limit guide per embedding model, debugging workflow, and known trade-offs.
 ## [0.5.1] - 2026-04-30
 ### 🚀 New Features
--- a/docs/CHUNKING.md
+++ b/docs/CHUNKING.md
@@ -0,0 +1,160 @@
 # Chunking Guide
 How WeKnora splits uploaded documents before embedding, why the defaults
 are what they are, and when to change them.
 ## Why chunking matters
 Retrieval-Augmented Generation (RAG) works by embedding small slices of
 your documents into a vector index, then pulling the most relevant
 slices back at query time. The way a document is sliced — chunk size,
 overlap, where the cuts fall — directly drives retrieval recall and the
 quality of the answers your LLM produces.
 Empirically (Vecta Feb-2026 benchmark across 50 academic papers):
 recursive splitting at ~512 tokens with ~15% overlap is the strongest
 single-knob baseline at 69% end-to-end accuracy, beating semantic
 chunking and over-engineered hybrids. WeKnora uses that as the
 foundation and layers smarter strategies on top when the document gives
 us structural cues.
 ## Adaptive 3-tier chunking
 Set per knowledge base via the editor's **Chunking** sidebar (or the
 `strategy` field on the KB-config API).
 | Strategy | When picked | What it does |
 |----------|-------------|--------------|
 | `auto` (recommended) | Default for new KBs | Profiles the document and picks the strongest tier from the chain below. |
 | `heading` | Markdown-style structure | Splits at `#` / `##` / `###` boundaries. Each chunk gets a breadcrumb context header (`# Top > ## Section`) prepended at embedding time. |
 | `heuristic` | PDF-style structure | Splits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators. |
 | `legacy` (= `recursive`) | Anything else, or as fallback | Pure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes. |
 A document profiler runs first and counts structural signals (Markdown
 headings, form-feeds, chapter markers per language, all-caps lines,
 visual separators, blank-line bursts). Auto-strategy picks the tier
 chain based on those counts; a validator rejects obviously broken
 output (e.g. the heading splitter producing 200 single-line chunks)
 and falls through to the next tier.
 ## Settings reference
 ### Core
 | Setting | Range | Default | Sweet spot for… |
 |---------|-------|---------|-----------------|
 | **Chunk size** | 100–4000 chars | 512 | Default works for most cases. 200–400 for FAQs / atomic Q&A. 1000–2000 for narrative documents. |
 | **Chunk overlap** | 0–500 chars | 80 (~15%) | 0 for FAQs and structured records. 80 (default) for general documents. 150–200 for argumentative texts where reasoning crosses chunks. |
 | **Separators** | string list | `["\n\n", "\n", "。", "！", "？", ";", "；"]` | Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize. |
 ### Parent-Child chunking
 Two-level retrieval: **child chunks** (small, embedded for vector match)
 and **parent chunks** (larger, returned to the LLM for context).
 | Setting | Range | Default | Notes |
 |---------|-------|---------|-------|
 | **Enable parent-child** | toggle | on | Recommended for documents > 10 pages. Skip for short FAQs to halve storage cost. |
 | **Parent chunk size** | 512–8192 chars | 4096 (~1000 EN tokens) | Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (1024–2048) for local LLMs with 4k contexts. |
 | **Child chunk size** | 64–2048 chars | 384 (~80 EN tokens) | 128–256 for Q&A-style precise matching. 512–1024 if your embedder accepts >1000 tokens (E5 / BGE-large). |
 ### Advanced
 | Setting | Range | Default | When to set |
 |---------|-------|---------|-------------|
 | **Token limit** | 0–8192 | 0 (off) | Activate when your embedding model has a small token cap. See table below. |
 | **Languages** | `de` / `en` / `zh` (multi-select) | empty (auto-detect) | Set explicitly for homogeneous corpora to narrow heuristic patterns. |
 #### Token-limit guide per embedding model
 | Embedder | Token limit | Recommended `tokenLimit` setting |
 |----------|-------------|----------------------------------|
 | OpenAI `text-embedding-3-small/large` | 8191 | **0 (leave off)** |
 | Anthropic Voyage-3 | 32000 | **0** |
 | Jina-embeddings-v3 | 8192 | **0** |
 | Cohere `embed-multilingual-v3` | 512 | **400** |
 | BGE-base / BGE-large / E5-large | 512 | **400** |
 | Sentence-Transformer `all-MiniLM-L6-v2` | 256 | **200** |
 Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens.
 Activate to 80% of the model's hard limit for smaller embedders so
 chunks always fit even for CJK content (which is denser per character).
 ## Use-case presets
 | Workload | Strategy | ChunkSize | Overlap | Parent-Child |
 |----------|----------|-----------|---------|--------------|
 | FAQ / Q&A knowledge base | `auto` (likely picks legacy) | 200–400 | 0 | off |
 | Markdown documentation / wikis | `auto` (picks heading) | 512 | 80 | on |
 | PDF reports with page breaks | `auto` (picks heuristic) | 800–1200 | 100–150 | on |
 | Long-form narrative (books, articles) | `auto` (picks recursive) | 1000–2000 | 150–200 | on |
 | Code documentation | `legacy` | 800 | 100 | optional |
 | Mixed-language corpus | `auto`, languages = empty | 512 | 80 | on |
 | Tabular reports / CSV-derived | `legacy` | 400 | 0 | off |
 ## Debugging in the UI
 The KB editor's **Chunking** sidebar has a "Test with sample text"
 collapsible at the bottom:
 1. Paste a Markdown / plain-text snippet (max 64 KB).
 2. Click **Run preview**.
 3. The panel shows:
   - Selected strategy tier as a colored tag
   - Tiers that were rejected and why (e.g. "too many tiny chunks")
   - Document profile (heading counts, form-feeds, chapter markers,
     detected languages)
   - Size statistics over the full chunk set (avg / min / max / stddev)
   - Per-chunk cards with size in chars + approximate tokens, position
     range, the section breadcrumb (when set), and a content preview
 This runs read-only against a goroutine-isolated splitter pass (5s
 timeout) — no DB writes, no embedding API calls. Use it to compare
 configurations against the same sample before triggering a re-upload.
 ## API
 ```http
 PUT /api/v1/initialization/config/:kbId
 Authorization: Bearer <jwt>
 Content-Type: application/json
 {
  "documentSplitting": {
    "chunkSize": 512,
    "chunkOverlap": 80,
    "separators": ["\n\n", "\n", "。", "！", "？", ";", "；"],
    "strategy": "auto",
    "tokenLimit": 0,
    "languages": ["de", "en"],
    "enableParentChild": true,
    "parentChunkSize": 4096,
    "childChunkSize": 384
  }
 }
 ```
 The `strategy`, `tokenLimit`, and `languages` fields use pointer-based
 DTOs server-side: omitting them in the payload means "no change",
 sending an empty string / 0 / [] explicitly resets to default.
 The preview endpoint accepts the same payload shape under
 `POST /api/v1/chunker/preview` with an additional `text` field.
 ## Known trade-offs
 - **Tier-1 heading-aware chunking** prepends the section breadcrumb to
  the embedding input, costing ~5% more tokens per chunk in exchange
  for ~30–50% fewer chunks on structured documents (net token savings
  on storage and at query time).
 - **Strategy switches do not auto-reindex** existing documents. After
  changing a KB's strategy, re-upload affected files (or trigger
  re-indexing via the UI) to apply the new chunking.
 - **OCR artifacts in PDFs** (vertical layout text broken character-
  by-character into separate lines) cannot be fixed by any splitter —
  this is a parser-side limitation. The heuristic tier still keeps
  chunks aligned to page boundaries, which mitigates the worst cases.
 - **The `recursive` strategy value** exists in the API for completeness
  but is intentionally hidden from the UI: it is functionally near
  `legacy` and adding a fifth dropdown option dilutes the meaningful
  choice between automatic / Markdown / heuristic / legacy.
--- a/frontend/src/i18n/locales/en-US.ts
+++ b/frontend/src/i18n/locales/en-US.ts
@@ -1873,14 +1873,14 @@ export default {
    },
    chunking: {
      title: 'Chunking Settings',
-      description: 'Configure document chunking parameters to improve retrieval quality',
+      description: 'Controls how uploaded documents are split before embedding. Defaults work for most cases — tune only when retrieval quality is off.',
      sizeLabel: 'Chunk Size',
-      sizeDescription: 'Controls the number of characters in each chunk (100-4000)',
+      sizeDescription: 'Maximum characters per chunk (100–4000). Default 512 ≈ 100–130 English tokens. Smaller for FAQs (200–400), larger for narrative documents (1000–2000).',
      characters: 'characters',
      overlapLabel: 'Chunk Overlap',
-      overlapDescription: 'Number of overlapping characters between adjacent chunks (0-500)',
+      overlapDescription: 'Characters shared between adjacent chunks (0–500). Default 80 ≈ 15% of size — sweet spot per current research. Use 0 for FAQs/structured data, 150–200 for long-form narratives.',
      separatorsLabel: 'Separators',
-      separatorsDescription: 'Separators used when chunking documents',
+      separatorsDescription: 'Characters or strings the splitter prefers when cutting. Higher-priority separators are tried first; the default order favors paragraph → sentence → punctuation breaks.',
      separatorsPlaceholder: 'Select or customize separators',
      separators: {
        doubleNewline: 'Double newline (\
@@ -1896,11 +1896,11 @@ export default {
        space: 'Space ( )'
      },
      parentChildLabel: 'Parent-Child Chunking',
-      parentChildDescription: 'Enable two-level parent-child chunking strategy. Large parent chunks provide context while small child chunks are used for vector matching.',
+      parentChildDescription: 'Two-level chunking: small child chunks are vector-matched (precise hits) but the larger parent chunk is returned to the LLM (richer context). Recommended for long documents (>10 pages); skip for short FAQs to save storage.',
      parentChunkSizeLabel: 'Parent Chunk Size',
-      parentChunkSizeDescription: 'Size of parent chunks that provide context (256-4096)',
+      parentChunkSizeDescription: 'Size of the context chunk returned to the LLM (512–8192). Default 4096 ≈ 1000 English tokens, fits comfortably in any modern LLM context window.',
      childChunkSizeLabel: 'Child Chunk Size',
-      childChunkSizeDescription: 'Size of child chunks used for embedding matching (64-1024)',
+      childChunkSizeDescription: 'Size of the embedded chunk used for vector match (64–2048). Default 384 ≈ 80 tokens — sweet spot for sentence-transformer / BGE-style embedders.',
      strategyLabel: 'Chunking Strategy',
      strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.',
      strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)',
@@ -1925,9 +1925,9 @@ export default {
      overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.',
      advancedLabel: 'Advanced options',
      tokenLimitLabel: 'Token limit per chunk',
-      tokenLimitDescription: 'Cap chunk size in approximate tokens. When set, chunk size is automatically reduced to stay below this token budget. 0 = off (use character size only).',
+      tokenLimitDescription: 'Hard token cap per chunk (0–8192). 0 = off (chunk size in characters only). Activate when your embedding model has a small token limit: 200 for MiniLM (256 tok), 400 for BGE/Cohere (512 tok). Modern embedders (OpenAI, Voyage, Jina-v3) accept >2000 tokens — leave at 0.',
      languagesLabel: 'Language hints',
-      languagesDescription: 'Hint which languages structure-detection patterns should look for. Leave empty for auto-detection.',
+      languagesDescription: 'Restricts heuristic patterns to the chosen languages (DE/EN/ZH). Empty = auto-detect from sample. Set explicitly for homogeneous corpora to avoid false-positive matches across languages.',
      languagesPlaceholder: 'Auto-detect',
      languageOptions: {
        de: 'German',
--- a/frontend/src/i18n/locales/ko-KR.ts
+++ b/frontend/src/i18n/locales/ko-KR.ts
@@ -2453,14 +2453,14 @@ export default {
    },
    chunking: {
      title: "청크 설정",
-      description: "문서 청킹 파라미터를 설정하여 검색 효과 최적화",
+      description: "업로드된 문서가 임베딩되기 전에 분할되는 방식을 제어합니다. 대부분의 경우 기본값으로 충분합니다.",
      sizeLabel: "청크 크기",
-      sizeDescription: "각 문서 청크의 문자 수 제어 (100-4000)",
+      sizeDescription: "청크당 최대 문자 수 (100-4000). 기본값 512 ≈ 영어 100-130 토큰. FAQ는 200-400, 서술형 문서는 1000-2000.",
      characters: "문자",
      overlapLabel: "청크 중복",
-      overlapDescription: "인접 문서 청크 간의 중복 문자 수 (0-500)",
+      overlapDescription: "인접 청크 간 공유 문자 수 (0-500). 기본값 80 ≈ 청크 크기의 15% — 현재 연구 권장값. FAQ/구조화 데이터는 0, 긴 서술은 150-200.",
      separatorsLabel: "구분자",
-      separatorsDescription: "문서 청킹 시 사용되는 구분자",
+      separatorsDescription: "분할 시 우선적으로 사용되는 문자/문자열. 우선순위가 높은 구분자를 먼저 시도; 기본 순서는 단락 → 문장 → 구두점.",
      separatorsPlaceholder: "구분자 선택 또는 사용자 정의",
      separators: {
        doubleNewline: "이중 줄바꿈 (\\n\\n)",
@@ -2473,11 +2473,11 @@ export default {
        space: "공백 ( )",
      },
      parentChildLabel: "부모-자식 청킹",
-      parentChildDescription: "2단계 부모-자식 청킹 전략을 활성화합니다. 큰 부모 청크는 컨텍스트를 제공하고, 작은 자식 청크는 벡터 매칭에 사용됩니다.",
+      parentChildDescription: "2단계 청킹: 작은 자식 청크는 벡터 매칭(정확한 히트), 큰 부모 청크는 LLM에 반환(풍부한 컨텍스트). 긴 문서(>10페이지)에 권장; 짧은 FAQ는 비활성화하여 저장 공간 절약.",
      parentChunkSizeLabel: "부모 청크 크기",
-      parentChunkSizeDescription: "컨텍스트를 제공하는 부모 청크의 문자 수 (256-4096)",
+      parentChunkSizeDescription: "LLM에 반환되는 컨텍스트 청크 크기 (512-8192). 기본값 4096 ≈ 1000 영어 토큰, 모든 현대 LLM 컨텍스트에 적합.",
      childChunkSizeLabel: "자식 청크 크기",
-      childChunkSizeDescription: "임베딩 매칭에 사용되는 자식 청크의 문자 수 (64-1024)",
+      childChunkSizeDescription: "벡터 매칭에 사용되는 임베딩 청크 크기 (64-2048). 기본값 384 ≈ 80 토큰 — sentence-transformer / BGE 임베더의 최적점.",
      strategyLabel: "청킹 전략",
      strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.",
      strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)",
@@ -2502,9 +2502,9 @@ export default {
      overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.",
      advancedLabel: "고급 옵션",
      tokenLimitLabel: "청크당 토큰 제한",
-      tokenLimitDescription: "근사 토큰 수로 청크 크기를 제한합니다. 설정되면 토큰 예산을 유지하기 위해 청크 크기가 자동으로 축소됩니다. 0 = 끄기 (문자 크기만 사용).",
+      tokenLimitDescription: "청크당 토큰 하드 제한 (0-8192). 0 = 끄기 (문자 크기만). 임베딩 모델의 토큰 제한이 작을 때 활성화: MiniLM (256 tok)은 200, BGE/Cohere (512 tok)는 400. 현대 임베더(OpenAI, Voyage, Jina-v3)는 >2000 토큰 지원 — 0으로 두세요.",
      languagesLabel: "언어 힌트",
-      languagesDescription: "구조 감지 패턴이 찾아야 할 언어를 힌트로 제공합니다. 자동 감지를 위해 비워두세요.",
+      languagesDescription: "휴리스틱 패턴을 선택한 언어(DE/EN/ZH)로만 제한합니다. 비어 있음 = 샘플에서 자동 감지. 동질적인 코퍼스는 명시적으로 설정하여 언어 간 오탐 방지.",
      languagesPlaceholder: "자동 감지",
      languageOptions: {
        de: "독일어",
--- a/frontend/src/i18n/locales/ru-RU.ts
+++ b/frontend/src/i18n/locales/ru-RU.ts
@@ -2088,14 +2088,14 @@ export default {
    },
    chunking: {
      title: 'Настройки разбиения',
-      description: 'Настройте параметры разбиения документов для улучшения качества поиска',
+      description: 'Управляет тем, как загруженные документы разбиваются перед эмбеддингом. Значения по умолчанию подходят для большинства случаев — настройте только если качество поиска плохое.',
      sizeLabel: 'Размер блока',
-      sizeDescription: 'Определяет количество символов в каждом блоке (100-4000)',
+      sizeDescription: 'Максимальное количество символов в блоке (100–4000). По умолчанию 512 ≈ 100–130 английских токенов. Меньше для FAQ (200–400), больше для повествовательных документов (1000–2000).',
      characters: 'символов',
      overlapLabel: 'Перекрытие блоков',
-      overlapDescription: 'Количество перекрывающихся символов между соседними блоками (0-500)',
+      overlapDescription: 'Количество символов, общих для соседних блоков (0–500). По умолчанию 80 ≈ 15% размера — оптимум по текущим исследованиям. 0 для FAQ/структурированных данных, 150–200 для длинных повествований.',
      separatorsLabel: 'Разделители',
-      separatorsDescription: 'Разделители, используемые при разбиении документов',
+      separatorsDescription: 'Символы или строки, которые сплиттер предпочитает при резке. Разделители более высокого приоритета пробуются первыми; порядок по умолчанию: абзацы → предложения → пунктуация.',
      separatorsPlaceholder: 'Выберите или настройте разделители',
      separators: {
        doubleNewline: 'Двойной перевод строки (\\n\\n)',
@@ -2108,11 +2108,11 @@ export default {
        space: 'Пробел ( )'
      },
      parentChildLabel: 'Родительско-дочернее разбиение',
-      parentChildDescription: 'Включить двухуровневую стратегию разбиения. Большие родительские блоки обеспечивают контекст, а маленькие дочерние блоки используются для векторного поиска.',
+      parentChildDescription: 'Двухуровневое разбиение: маленькие дочерние блоки используются для векторного матчинга (точные совпадения), большой родительский блок возвращается LLM (более богатый контекст). Рекомендуется для длинных документов (>10 страниц); пропустите для коротких FAQ для экономии хранилища.',
      parentChunkSizeLabel: 'Размер родительского блока',
-      parentChunkSizeDescription: 'Размер родительских блоков для контекста (256-4096)',
+      parentChunkSizeDescription: 'Размер контекстного блока, возвращаемого LLM (512–8192). По умолчанию 4096 ≈ 1000 английских токенов, комфортно вписывается в любое современное контекстное окно.',
      childChunkSizeLabel: 'Размер дочернего блока',
-      childChunkSizeDescription: 'Размер дочерних блоков для поиска по эмбеддингам (64-1024)',
+      childChunkSizeDescription: 'Размер встроенного блока для векторного матчинга (64–2048). По умолчанию 384 ≈ 80 токенов — оптимум для эмбеддеров уровня sentence-transformer / BGE.',
      strategyLabel: 'Стратегия разбиения',
      strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.',
      strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)',
@@ -2137,9 +2137,9 @@ export default {
      overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.',
      advancedLabel: 'Расширенные параметры',
      tokenLimitLabel: 'Лимит токенов на блок',
-      tokenLimitDescription: 'Ограничьте размер блока приблизительным количеством токенов. При установке размер блока автоматически уменьшается, чтобы оставаться в пределах лимита. 0 = выключено (только символьный размер).',
+      tokenLimitDescription: 'Жёсткий лимит токенов на блок (0–8192). 0 = выкл (только символы). Активируйте, когда у вашего эмбеддера небольшой токен-лимит: 200 для MiniLM (256 ток), 400 для BGE/Cohere (512 ток). Современные эмбеддеры (OpenAI, Voyage, Jina-v3) поддерживают >2000 токенов — оставьте 0.',
      languagesLabel: 'Языковые подсказки',
-      languagesDescription: 'Подсказка о том, какие языки должны искать паттерны определения структуры. Оставьте пустым для автоопределения.',
+      languagesDescription: 'Ограничивает эвристические паттерны выбранными языками (DE/EN/ZH). Пусто = автоопределение из образца. Установите явно для однородных корпусов, чтобы избежать ложных срабатываний между языками.',
      languagesPlaceholder: 'Автоопределение',
      languageOptions: {
        de: 'Немецкий',
--- a/frontend/src/i18n/locales/zh-CN.ts
+++ b/frontend/src/i18n/locales/zh-CN.ts
@@ -2412,14 +2412,14 @@ export default {
    },
    chunking: {
      title: "分块设置",
-      description: "配置文档分块参数，优化检索效果",
+      description: "控制上传文档在嵌入前的切分方式。默认值适用于大多数场景，仅在检索质量异常时调整。",
      sizeLabel: "分块大小",
-      sizeDescription: "控制每个文档分块的字符数（100-4000）",
+      sizeDescription: "每个分块的最大字符数（100-4000）。默认 512 ≈ 中文 300 tokens / 英文 100-130 tokens。FAQ 用 200-400，叙述性长文档用 1000-2000。",
      characters: "字符",
      overlapLabel: "分块重叠",
-      overlapDescription: "相邻文档块之间的重叠字符数（0-500）",
+      overlapDescription: "相邻分块之间共享的字符数（0-500）。默认 80 ≈ 分块大小的 15%，符合当前研究推荐。FAQ/结构化数据用 0，长篇叙述用 150-200。",
      separatorsLabel: "分隔符",
-      separatorsDescription: "文档分块时使用的分隔符",
+      separatorsDescription: "切分时优先使用的字符或字符串。优先级高的分隔符先尝试；默认顺序优先段落 → 句子 → 标点。",
      separatorsPlaceholder: "选择或自定义分隔符",
      separators: {
        doubleNewline: "双换行 (\\n\\n)",
@@ -2432,11 +2432,11 @@ export default {
        space: "空格 ( )",
      },
      parentChildLabel: "父子分块",
-      parentChildDescription: "启用两级父子分块策略。大的父块提供上下文，小的子块用于向量匹配检索。",
+      parentChildDescription: "两级分块：小的子块用于向量匹配（精准命中），大的父块返回给 LLM（更丰富上下文）。建议用于长文档（>10 页）；短 FAQ 可关闭以节省存储。",
      parentChunkSizeLabel: "父块大小",
-      parentChunkSizeDescription: "提供上下文的父块字符数（256-4096）",
+      parentChunkSizeDescription: "返回给 LLM 的上下文块大小（512-8192）。默认 4096 ≈ 1000 英文 tokens，适合所有现代 LLM 上下文窗口。",
      childChunkSizeLabel: "子块大小",
-      childChunkSizeDescription: "用于向量匹配的子块字符数（64-1024）",
+      childChunkSizeDescription: "用于向量匹配的嵌入块大小（64-2048）。默认 384 ≈ 80 tokens，是 sentence-transformer / BGE 类嵌入模型的最佳点。",
      strategyLabel: "分块策略",
      strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。",
      strategyPlaceholder: "选择策略（默认使用经典递归分块）",
@@ -2461,9 +2461,9 @@ export default {
      overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。",
      advancedLabel: "高级选项",
      tokenLimitLabel: "每块 Token 上限",
-      tokenLimitDescription: "按近似 Token 数限制分块大小。设置后会自动缩小分块以保持在 Token 预算内。0 = 关闭（仅按字符数）。",
+      tokenLimitDescription: "每个分块的硬性 Token 上限（0-8192）。0 = 关闭（仅按字符数）。当嵌入模型 Token 上限较小时启用：MiniLM (256 tok) 用 200，BGE/Cohere (512 tok) 用 400。现代嵌入器（OpenAI、Voyage、Jina-v3）支持 >2000 tokens，保持 0 即可。",
      languagesLabel: "语言提示",
-      languagesDescription: "提示结构检测模式应识别哪些语言。留空则自动检测。",
+      languagesDescription: "限制启发式模式只识别选定的语言（DE/EN/ZH）。留空 = 自动检测。同质化语料库可显式设置以避免跨语言误匹配。",
      languagesPlaceholder: "自动检测",
      languageOptions: {
        de: "德语",
--- a/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue
+++ b/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue
@@ -409,10 +409,13 @@ const WIKI_ONLY_CHUNKING_PRESET = {
  enableParentChild: false,
 } as const
-// 非 Wiki-only 场景下回落到的默认值（与 initFormData 保持一致）。
+// Non-Wiki-only fallback. Mirrors chunker.DefaultChunkSize and
 // DefaultChunkOverlap on the backend so a freshly created KB uses
 // the same numbers whether the editor sets them or the splitter
 // falls back to its package defaults.
 const DEFAULT_CHUNKING_PRESET = {
  chunkSize: 512,
-  chunkOverlap: 100,
+  chunkOverlap: 80,
  enableParentChild: true,
 } as const
@@ -485,7 +488,9 @@ const initFormData = (type: 'document' | 'faq' = 'document') => {
    },
    chunkingConfig: {
      chunkSize: 512,
-      chunkOverlap: 100,
+      // 80 ≈ 15% of chunkSize — community-recommended sweet spot.
      // Aligned with chunker.DefaultChunkOverlap on the backend.
      chunkOverlap: 80,
      separators: ['\n\n', '\n', '。', '！', '？', ';', '；'],
      parserEngineRules: undefined as any,
      enableParentChild: true,
@@ -586,7 +591,9 @@ const loadKBData = async () => {
      },
      chunkingConfig: {
        chunkSize: kb.chunking_config?.chunk_size || 512,
-        chunkOverlap: kb.chunking_config?.chunk_overlap || 100,
+        // Fallback only used when the loaded KB has no chunk_overlap stored.
        // Aligned with chunker.DefaultChunkOverlap on the backend.
        chunkOverlap: kb.chunking_config?.chunk_overlap || 80,
        separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '！', '？', ';', '；'],
        parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined,
        enableParentChild: kb.chunking_config?.enable_parent_child || false,
--- a/frontend/src/views/knowledge/settings/KBChunkingSettings.vue
+++ b/frontend/src/views/knowledge/settings/KBChunkingSettings.vue
@@ -217,6 +217,19 @@ interface ParserEngineRule {
  engine: string
 }
 // Slider ranges defined in this file (min/max props on t-slider) mirror
 // the validated bounds in the backend splitter:
 //   ChunkSize:      100–4000  (default 512). 100 = too fragmented to be
 //                   useful; 4000 = approaches the 7500-char absoluteMaxSize
 //                   that the splitter hard-caps to anyway.
 //   ChunkOverlap:   0–500     (default 80). Backend caps to ChunkSize/2
 //                   when set higher than that.
 //   ParentChunkSize: 512–8192 (default 4096 ≈ 1000 EN tokens).
 //   ChildChunkSize:  64–2048  (default 384 ≈ 80 EN tokens, sweet spot for
 //                   sentence-transformer / BGE embedders).
 //   TokenLimit:      0–8192   (default 0 = off, char-based budget only).
 //                   Set to 200 for MiniLM (256-tok limit), 400 for BGE/
 //                   Cohere (512-tok), leave at 0 for OpenAI/Voyage/Jina-v3.
 interface ChunkingConfig {
  chunkSize: number
  chunkOverlap: number
@@ -225,11 +238,11 @@ interface ChunkingConfig {
  enableParentChild: boolean
  parentChunkSize: number
  childChunkSize: number
-  // New: adaptive chunking strategy. Empty string = legacy / not set.
+  // Adaptive chunking strategy. Empty string = legacy / not set.
  strategy?: string
-  // New: cap chunk size in approx tokens. 0 = char-based budget only.
+  // Cap chunk size in approx tokens. 0 = char-based budget only.
  tokenLimit?: number
-  // New: language hints for heuristic patterns (de/en/zh).
+  // Language hints for heuristic patterns (de/en/zh).
  languages?: string[]
 }
--- a/internal/infrastructure/chunker/splitter.go
+++ b/internal/infrastructure/chunker/splitter.go
@@ -66,19 +66,33 @@ type SplitterConfig struct {
 	Languages []string
 }
-// Default sizes used by all entry points (DefaultConfig, ensureDefaults,
+// Default chunk sizing constants. Single source of truth for the entire
-// and buildSplitterConfig in the knowledge service).
+// chunker package and (via knowledge.go::buildSplitterConfig) the
 // knowledge service. The frontend KnowledgeBaseEditorModal mirrors these
 // numbers in its initial form state — keep them in sync if you change
 // either value here.
 //
 // DefaultChunkSize = 512 chars: ~100–130 English tokens / ~300 Chinese
 // tokens. Validated as a strong baseline by the Vecta Feb-2026 benchmark
 // across 50 academic papers. Use 200–400 for FAQ-style atomic content,
 // 1000–2000 for narrative / argumentative documents.
 //
 // DefaultChunkOverlap = 80 chars (≈15% of DefaultChunkSize): community-
 // recommended sweet spot between recall (an answer split across a
 // boundary needs overlap to be retrievable) and storage cost. Use 0 for
 // strictly atomic data (FAQ, JSON records), 150–200 for long narratives
 // where reasoning crosses chunks.
 //
 // MIGRATION NOTE: Prior versions had three different overlap defaults
 // (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python
-// docreader: 100). This file is now the single source of truth at 80
+// docreader: 100). All consolidated to 80 here.
 // (≈15% of DefaultChunkSize) — a community-recommended sweet spot.
 //
-// Existing knowledge bases that stored ChunkOverlap=0 in the DB will pick
+// Existing knowledge bases that stored ChunkOverlap=0 in the DB pick
-// up 80 on next re-index; their previously-indexed embeddings will not
+// this 80 up on next re-index; their previously-indexed embeddings will
-// match new ones bit-for-bit. Recall stays similar but search ranking
+// not match new ones bit-for-bit. Recall stays similar but search
-// can shift slightly. To freeze the old behavior on a per-KB basis,
+// ranking can shift slightly. To freeze the old behavior on a per-KB
-// explicitly set ChunkingConfig.ChunkOverlap to 64 before re-indexing.
+// basis, explicitly set ChunkingConfig.ChunkOverlap to 64 before
 // re-indexing.
 const (
 	DefaultChunkSize    = 512
 	DefaultChunkOverlap = 80