docs(chunking): align overlap default + sharpen UI/source/repo docs

Three documentation passes around the adaptive chunking work:

UI
- Frontend ChunkOverlap default consolidated to 80 (was 100), matching
  chunker.DefaultChunkOverlap on the backend. Both DEFAULT_CHUNKING_PRESET
  and initFormData updated. The KB-load fallback also uses 80 when a
  loaded KB has no chunk_overlap stored.
- All four locales (en-US, zh-CN, ko-KR, ru-RU) get rewritten chunking
  setting descriptions: each now states the validated range, the default,
  and the situations where you'd deviate (FAQ vs narrative, embedder
  token limits, language-specific corpora).

Source code
- splitter.go: DefaultChunkSize / DefaultChunkOverlap constants get a
  longer block-comment explaining the per-language token math and the
  use-case sweet spots, plus the migration note on what the old
  inconsistent defaults were.
- KBChunkingSettings.vue: new comment block above ChunkingConfig
  documents the slider min/max for each setting, why those bounds
  exist, and the recommended TokenLimit values per embedding model.

Repo docs
- New docs/CHUNKING.md: end-to-end guide covering why chunking matters,
  the adaptive 3-tier architecture, per-setting reference with ranges
  and sweet spots, parent-child explanation, the token-limit table per
  embedder (OpenAI / Voyage / Cohere / BGE / MiniLM / Jina), 7 use-case
  presets, the debug panel workflow, the API surface, and known
  trade-offs (recursive strategy hidden from UI, no auto-reindex on
  strategy switch, OCR limitations).
- CHANGELOG.md gets a new [Unreleased] section consolidating all the
  adaptive-chunking work shipped on this branch: 5 features, 8
  improvements, 6 fixes, 1 docs entry. The entry references
  docs/CHUNKING.md for deeper explanation.

https://claude.ai/code/session_01XADhx6mtu2ZYW3DE9Lun6k
This commit is contained in:
Claude
2026-05-03 11:57:03 +00:00
parent 8fd06b1e07
commit 13f57caa3e
9 changed files with 277 additions and 52 deletions

View File

@@ -2,6 +2,37 @@
All notable changes to this project will be documented in this file. All notable changes to this project will be documented in this file.
## [Unreleased]
### 🚀 New Features
- **NEW**: Adaptive 3-tier chunking — documents are now profiled before splitting and routed to one of three strategies: heading-aware (Markdown structure), heuristic (form-feeds, multilingual chapter markers DE/EN/ZH, all-caps titles, visual separators), or recursive (the modernized legacy splitter as a fallback). Auto-strategy is the new default for fresh KBs; existing KBs keep their previous behavior until the user opts in. See `docs/CHUNKING.md`.
- **NEW**: KB editor — chunking settings panel surfaces the new strategy selector (Automatic / Markdown-optimized / Smart structure detection / Classic) plus advanced options for token limit per chunk and language hints. Sharper inline help text on every setting explains when defaults apply and when to tune.
- **NEW**: Chunking debug panel — embedded "Test with sample text" panel under the chunking settings. Paste a snippet, hit Run preview, see selected tier, rejected tiers + reasons, document profile, size distribution stats over the full chunk set, and per-chunk cards with breadcrumb + content preview. Read-only, no DB or embedding side effects, 5-second server-side timeout.
- **NEW**: `POST /api/v1/chunker/preview` endpoint backing the debug panel. Returns `selected_tier`, `tier_chain`, `rejected[]`, `profile`, `chunks[]`, and `stats`. Capped at 64k input runes / 500 chunks per response.
- **NEW**: Per-tenant RRF (Reciprocal Rank Fusion) tuning — `RRFK`, `RRFVectorWeight`, `RRFKeywordWeight` are now configurable on the tenant `RetrievalConfig`. Defaults preserve the previous hardcoded behavior (k=60, weights 0.7/0.3).
### ⚡ Improvements
- **IMPROVED**: Chunker recursive priority — `splitBySeparators` now genuinely walks separators by priority and recursively re-splits oversize sub-pieces with the next-priority separator. Mirrors the Python reference. Without this fix, a "one paragraph break followed by a long run of newline-separated lines" pattern could emit ~1900-rune chunks at chunkSize=300.
- **IMPROVED**: ChunkOverlap default consolidated to 80 (~15% of ChunkSize). Previously the Go DefaultConfig used 64, the knowledge service used 50, the Python docreader used 100, and the frontend form initialised to 100. All paths now align.
- **IMPROVED**: ContextHeader (Markdown breadcrumb) lives on `Chunk.ContextHeader`, separate from `Chunk.Content`. Restores the `End-Start == len(Content)` invariant that the document-reconstruction path in `knowledge.go` relies on for summary generation and UI highlighting. Eliminates a duplicate-heading regression where the section heading appeared twice in a chunk's body.
- **IMPROVED**: Embedding pipeline — exponential backoff (200/400/800/1600/3200 ms) replaces the previous fixed 100ms × 5 retry loop, with context-cancellation between attempts. `sanitizeForEmbedding` caps single embedding inputs at 20k runes with a warning log on overflow.
- **IMPROVED**: SplitParentChild forces children onto the recursive tier, skipping per-parent profile passes (previously paid N extra O(N) document scans).
- **IMPROVED**: Heuristic splitter snaps overlap start to the nearest semantic boundary or newline instead of slicing mid-line / mid-word.
- **IMPROVED**: Validator flow — when every tier is rejected, the chain returns the legacy tier's output directly instead of running SplitText a second time.
- **IMPROVED**: Token limit per chunk — when set, ChunkSize is auto-clamped to a per-language character budget (with a 10% safety factor). Prevents overshooting embedding model token caps on CJK content where 1 char ≈ 0.6 tokens.
- **IMPROVED**: KB-config API — `strategy`, `tokenLimit`, `languages` use pointer DTOs server-side so a payload omitting a field means "no change" while an explicit empty / 0 / [] resets to default. Previously these were write-once fields.
### 🐛 Bug Fixes
- **FIXED**: Chunker — `Chunk.Start` / `End` rune-offset invariant restored after the heading-aware splitter started prepending breadcrumbs to content (regression introduced during the initial Tier-1 work, fixed before any release).
- **FIXED**: Heuristic splitter — `applyOverlap` aligns to boundaries instead of doing blind char-subtraction that could leave chunks starting mid-word in CJK text.
- **FIXED**: Preview endpoint — chunk-size statistics are now computed over the FULL chunk set before truncating the response payload to 500 entries. Previously `avg`/`min`/`max`/`stddev` reflected only the first 500 chunks of a larger split.
- **FIXED**: Preview endpoint — empty / whitespace-only sample text now returns a friendly 400 ("paste a sample…") instead of gin's cryptic `Field validation failed` error.
- **FIXED**: Frontend chunking debug panel — added explicit `type="button"`, prominent loading and error states, and console error logging so failed previews are debuggable from DevTools without enabling verbose logging. Earlier the panel could appear to "vanish" silently when a request failed.
- **FIXED**: KB-editor i18n — `chunkOverlap` initial form value aligned with the backend default (80, not 100); description texts on every chunking setting now state the recommended ranges per use-case.
### 📚 Documentation
- **DOC**: New `docs/CHUNKING.md` — strategy explanations, settings reference with use-case presets, token-limit guide per embedding model, debugging workflow, and known trade-offs.
## [0.5.1] - 2026-04-30 ## [0.5.1] - 2026-04-30
### 🚀 New Features ### 🚀 New Features

160
docs/CHUNKING.md Normal file
View File

@@ -0,0 +1,160 @@
# Chunking Guide
How WeKnora splits uploaded documents before embedding, why the defaults
are what they are, and when to change them.
## Why chunking matters
Retrieval-Augmented Generation (RAG) works by embedding small slices of
your documents into a vector index, then pulling the most relevant
slices back at query time. The way a document is sliced — chunk size,
overlap, where the cuts fall — directly drives retrieval recall and the
quality of the answers your LLM produces.
Empirically (Vecta Feb-2026 benchmark across 50 academic papers):
recursive splitting at ~512 tokens with ~15% overlap is the strongest
single-knob baseline at 69% end-to-end accuracy, beating semantic
chunking and over-engineered hybrids. WeKnora uses that as the
foundation and layers smarter strategies on top when the document gives
us structural cues.
## Adaptive 3-tier chunking
Set per knowledge base via the editor's **Chunking** sidebar (or the
`strategy` field on the KB-config API).
| Strategy | When picked | What it does |
|----------|-------------|--------------|
| `auto` (recommended) | Default for new KBs | Profiles the document and picks the strongest tier from the chain below. |
| `heading` | Markdown-style structure | Splits at `#` / `##` / `###` boundaries. Each chunk gets a breadcrumb context header (`# Top > ## Section`) prepended at embedding time. |
| `heuristic` | PDF-style structure | Splits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators. |
| `legacy` (= `recursive`) | Anything else, or as fallback | Pure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes. |
A document profiler runs first and counts structural signals (Markdown
headings, form-feeds, chapter markers per language, all-caps lines,
visual separators, blank-line bursts). Auto-strategy picks the tier
chain based on those counts; a validator rejects obviously broken
output (e.g. the heading splitter producing 200 single-line chunks)
and falls through to the next tier.
## Settings reference
### Core
| Setting | Range | Default | Sweet spot for… |
|---------|-------|---------|-----------------|
| **Chunk size** | 1004000 chars | 512 | Default works for most cases. 200400 for FAQs / atomic Q&A. 10002000 for narrative documents. |
| **Chunk overlap** | 0500 chars | 80 (~15%) | 0 for FAQs and structured records. 80 (default) for general documents. 150200 for argumentative texts where reasoning crosses chunks. |
| **Separators** | string list | `["\n\n", "\n", "。", "", "", ";", ""]` | Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize. |
### Parent-Child chunking
Two-level retrieval: **child chunks** (small, embedded for vector match)
and **parent chunks** (larger, returned to the LLM for context).
| Setting | Range | Default | Notes |
|---------|-------|---------|-------|
| **Enable parent-child** | toggle | on | Recommended for documents > 10 pages. Skip for short FAQs to halve storage cost. |
| **Parent chunk size** | 5128192 chars | 4096 (~1000 EN tokens) | Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (10242048) for local LLMs with 4k contexts. |
| **Child chunk size** | 642048 chars | 384 (~80 EN tokens) | 128256 for Q&A-style precise matching. 5121024 if your embedder accepts >1000 tokens (E5 / BGE-large). |
### Advanced
| Setting | Range | Default | When to set |
|---------|-------|---------|-------------|
| **Token limit** | 08192 | 0 (off) | Activate when your embedding model has a small token cap. See table below. |
| **Languages** | `de` / `en` / `zh` (multi-select) | empty (auto-detect) | Set explicitly for homogeneous corpora to narrow heuristic patterns. |
#### Token-limit guide per embedding model
| Embedder | Token limit | Recommended `tokenLimit` setting |
|----------|-------------|----------------------------------|
| OpenAI `text-embedding-3-small/large` | 8191 | **0 (leave off)** |
| Anthropic Voyage-3 | 32000 | **0** |
| Jina-embeddings-v3 | 8192 | **0** |
| Cohere `embed-multilingual-v3` | 512 | **400** |
| BGE-base / BGE-large / E5-large | 512 | **400** |
| Sentence-Transformer `all-MiniLM-L6-v2` | 256 | **200** |
Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens.
Activate to 80% of the model's hard limit for smaller embedders so
chunks always fit even for CJK content (which is denser per character).
## Use-case presets
| Workload | Strategy | ChunkSize | Overlap | Parent-Child |
|----------|----------|-----------|---------|--------------|
| FAQ / Q&A knowledge base | `auto` (likely picks legacy) | 200400 | 0 | off |
| Markdown documentation / wikis | `auto` (picks heading) | 512 | 80 | on |
| PDF reports with page breaks | `auto` (picks heuristic) | 8001200 | 100150 | on |
| Long-form narrative (books, articles) | `auto` (picks recursive) | 10002000 | 150200 | on |
| Code documentation | `legacy` | 800 | 100 | optional |
| Mixed-language corpus | `auto`, languages = empty | 512 | 80 | on |
| Tabular reports / CSV-derived | `legacy` | 400 | 0 | off |
## Debugging in the UI
The KB editor's **Chunking** sidebar has a "Test with sample text"
collapsible at the bottom:
1. Paste a Markdown / plain-text snippet (max 64 KB).
2. Click **Run preview**.
3. The panel shows:
- Selected strategy tier as a colored tag
- Tiers that were rejected and why (e.g. "too many tiny chunks")
- Document profile (heading counts, form-feeds, chapter markers,
detected languages)
- Size statistics over the full chunk set (avg / min / max / stddev)
- Per-chunk cards with size in chars + approximate tokens, position
range, the section breadcrumb (when set), and a content preview
This runs read-only against a goroutine-isolated splitter pass (5s
timeout) — no DB writes, no embedding API calls. Use it to compare
configurations against the same sample before triggering a re-upload.
## API
```http
PUT /api/v1/initialization/config/:kbId
Authorization: Bearer <jwt>
Content-Type: application/json
{
"documentSplitting": {
"chunkSize": 512,
"chunkOverlap": 80,
"separators": ["\n\n", "\n", "", "", "", ";", ""],
"strategy": "auto",
"tokenLimit": 0,
"languages": ["de", "en"],
"enableParentChild": true,
"parentChunkSize": 4096,
"childChunkSize": 384
}
}
```
The `strategy`, `tokenLimit`, and `languages` fields use pointer-based
DTOs server-side: omitting them in the payload means "no change",
sending an empty string / 0 / [] explicitly resets to default.
The preview endpoint accepts the same payload shape under
`POST /api/v1/chunker/preview` with an additional `text` field.
## Known trade-offs
- **Tier-1 heading-aware chunking** prepends the section breadcrumb to
the embedding input, costing ~5% more tokens per chunk in exchange
for ~3050% fewer chunks on structured documents (net token savings
on storage and at query time).
- **Strategy switches do not auto-reindex** existing documents. After
changing a KB's strategy, re-upload affected files (or trigger
re-indexing via the UI) to apply the new chunking.
- **OCR artifacts in PDFs** (vertical layout text broken character-
by-character into separate lines) cannot be fixed by any splitter —
this is a parser-side limitation. The heuristic tier still keeps
chunks aligned to page boundaries, which mitigates the worst cases.
- **The `recursive` strategy value** exists in the API for completeness
but is intentionally hidden from the UI: it is functionally near
`legacy` and adding a fifth dropdown option dilutes the meaningful
choice between automatic / Markdown / heuristic / legacy.

View File

@@ -1873,14 +1873,14 @@ export default {
}, },
chunking: { chunking: {
title: 'Chunking Settings', title: 'Chunking Settings',
description: 'Configure document chunking parameters to improve retrieval quality', description: 'Controls how uploaded documents are split before embedding. Defaults work for most cases — tune only when retrieval quality is off.',
sizeLabel: 'Chunk Size', sizeLabel: 'Chunk Size',
sizeDescription: 'Controls the number of characters in each chunk (100-4000)', sizeDescription: 'Maximum characters per chunk (1004000). Default 512 ≈ 100130 English tokens. Smaller for FAQs (200400), larger for narrative documents (10002000).',
characters: 'characters', characters: 'characters',
overlapLabel: 'Chunk Overlap', overlapLabel: 'Chunk Overlap',
overlapDescription: 'Number of overlapping characters between adjacent chunks (0-500)', overlapDescription: 'Characters shared between adjacent chunks (0500). Default 80 ≈ 15% of size — sweet spot per current research. Use 0 for FAQs/structured data, 150200 for long-form narratives.',
separatorsLabel: 'Separators', separatorsLabel: 'Separators',
separatorsDescription: 'Separators used when chunking documents', separatorsDescription: 'Characters or strings the splitter prefers when cutting. Higher-priority separators are tried first; the default order favors paragraph → sentence → punctuation breaks.',
separatorsPlaceholder: 'Select or customize separators', separatorsPlaceholder: 'Select or customize separators',
separators: { separators: {
doubleNewline: 'Double newline (\ doubleNewline: 'Double newline (\
@@ -1896,11 +1896,11 @@ export default {
space: 'Space ( )' space: 'Space ( )'
}, },
parentChildLabel: 'Parent-Child Chunking', parentChildLabel: 'Parent-Child Chunking',
parentChildDescription: 'Enable two-level parent-child chunking strategy. Large parent chunks provide context while small child chunks are used for vector matching.', parentChildDescription: 'Two-level chunking: small child chunks are vector-matched (precise hits) but the larger parent chunk is returned to the LLM (richer context). Recommended for long documents (>10 pages); skip for short FAQs to save storage.',
parentChunkSizeLabel: 'Parent Chunk Size', parentChunkSizeLabel: 'Parent Chunk Size',
parentChunkSizeDescription: 'Size of parent chunks that provide context (256-4096)', parentChunkSizeDescription: 'Size of the context chunk returned to the LLM (5128192). Default 4096 ≈ 1000 English tokens, fits comfortably in any modern LLM context window.',
childChunkSizeLabel: 'Child Chunk Size', childChunkSizeLabel: 'Child Chunk Size',
childChunkSizeDescription: 'Size of child chunks used for embedding matching (64-1024)', childChunkSizeDescription: 'Size of the embedded chunk used for vector match (642048). Default 384 ≈ 80 tokens — sweet spot for sentence-transformer / BGE-style embedders.',
strategyLabel: 'Chunking Strategy', strategyLabel: 'Chunking Strategy',
strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.', strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.',
strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)', strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)',
@@ -1925,9 +1925,9 @@ export default {
overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.', overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.',
advancedLabel: 'Advanced options', advancedLabel: 'Advanced options',
tokenLimitLabel: 'Token limit per chunk', tokenLimitLabel: 'Token limit per chunk',
tokenLimitDescription: 'Cap chunk size in approximate tokens. When set, chunk size is automatically reduced to stay below this token budget. 0 = off (use character size only).', tokenLimitDescription: 'Hard token cap per chunk (08192). 0 = off (chunk size in characters only). Activate when your embedding model has a small token limit: 200 for MiniLM (256 tok), 400 for BGE/Cohere (512 tok). Modern embedders (OpenAI, Voyage, Jina-v3) accept >2000 tokens — leave at 0.',
languagesLabel: 'Language hints', languagesLabel: 'Language hints',
languagesDescription: 'Hint which languages structure-detection patterns should look for. Leave empty for auto-detection.', languagesDescription: 'Restricts heuristic patterns to the chosen languages (DE/EN/ZH). Empty = auto-detect from sample. Set explicitly for homogeneous corpora to avoid false-positive matches across languages.',
languagesPlaceholder: 'Auto-detect', languagesPlaceholder: 'Auto-detect',
languageOptions: { languageOptions: {
de: 'German', de: 'German',

View File

@@ -2453,14 +2453,14 @@ export default {
}, },
chunking: { chunking: {
title: "청크 설정", title: "청크 설정",
description: "문서 청킹 파라미터를 설정하여 검색 효과 최적화", description: "업로드된 문서가 임베딩되기 전에 분할되는 방식을 제어합니다. 대부분의 경우 기본값으로 충분합니다.",
sizeLabel: "청크 크기", sizeLabel: "청크 크기",
sizeDescription: "각 문서 청크의 문자 수 제어 (100-4000)", sizeDescription: "청크당 최대 문자 수 (100-4000). 기본값 512 ≈ 영어 100-130 토큰. FAQ는 200-400, 서술형 문서는 1000-2000.",
characters: "문자", characters: "문자",
overlapLabel: "청크 중복", overlapLabel: "청크 중복",
overlapDescription: "인접 문서 청크 간의 중복 문자 수 (0-500)", overlapDescription: "인접 청크 간 공유 문자 수 (0-500). 기본값 80 ≈ 청크 크기의 15% — 현재 연구 권장값. FAQ/구조화 데이터는 0, 긴 서술은 150-200.",
separatorsLabel: "구분자", separatorsLabel: "구분자",
separatorsDescription: "문서 청킹 시 사용되는 구분자", separatorsDescription: "분할 시 우선적으로 사용되는 문자/문자열. 우선순위가 높은 구분자를 먼저 시도; 기본 순서는 단락 → 문장 → 구두점.",
separatorsPlaceholder: "구분자 선택 또는 사용자 정의", separatorsPlaceholder: "구분자 선택 또는 사용자 정의",
separators: { separators: {
doubleNewline: "이중 줄바꿈 (\\n\\n)", doubleNewline: "이중 줄바꿈 (\\n\\n)",
@@ -2473,11 +2473,11 @@ export default {
space: "공백 ( )", space: "공백 ( )",
}, },
parentChildLabel: "부모-자식 청킹", parentChildLabel: "부모-자식 청킹",
parentChildDescription: "2단계 부모-자식 청킹 전략을 활성화합니다. 큰 부모 청크는 컨텍스트를 제공하고, 작은 자식 청크는 벡터 매칭에 사용됩니다.", parentChildDescription: "2단계 청킹: 작은 자식 청크는 벡터 매칭(정확한 히트), 큰 부모 청크는 LLM에 반환(풍부한 컨텍스트). 긴 문서(>10페이지)에 권장; 짧은 FAQ는 비활성화하여 저장 공간 절약.",
parentChunkSizeLabel: "부모 청크 크기", parentChunkSizeLabel: "부모 청크 크기",
parentChunkSizeDescription: "컨텍스트를 제공하는 부모 청크의 문자 수 (256-4096)", parentChunkSizeDescription: "LLM에 반환되는 컨텍스트 청크 크기 (512-8192). 기본값 4096 ≈ 1000 영어 토큰, 모든 현대 LLM 컨텍스트에 적합.",
childChunkSizeLabel: "자식 청크 크기", childChunkSizeLabel: "자식 청크 크기",
childChunkSizeDescription: "임베딩 매칭에 사용되는 자식 청크의 문자 수 (64-1024)", childChunkSizeDescription: "벡터 매칭에 사용되는 임베딩 청크 크기 (64-2048). 기본값 384 ≈ 80 토큰 — sentence-transformer / BGE 임베더의 최적점.",
strategyLabel: "청킹 전략", strategyLabel: "청킹 전략",
strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.", strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.",
strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)", strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)",
@@ -2502,9 +2502,9 @@ export default {
overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.", overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.",
advancedLabel: "고급 옵션", advancedLabel: "고급 옵션",
tokenLimitLabel: "청크당 토큰 제한", tokenLimitLabel: "청크당 토큰 제한",
tokenLimitDescription: "근사 토큰 수로 청크 크기를 제한합니다. 설정되면 토큰 예산을 유지하기 위해 청크 크기가 자동으로 축소됩니다. 0 = 끄기 (문자 크기만 사용).", tokenLimitDescription: "청크당 토큰 하드 제한 (0-8192). 0 = 끄기 (문자 크기만). 임베딩 모델의 토큰 제한이 작을 때 활성화: MiniLM (256 tok)은 200, BGE/Cohere (512 tok)는 400. 현대 임베더(OpenAI, Voyage, Jina-v3)는 >2000 토큰 지원 — 0으로 두세요.",
languagesLabel: "언어 힌트", languagesLabel: "언어 힌트",
languagesDescription: "구조 감지 패턴이 찾아야 할 언어를 힌트로합니다. 자동 감지를 위해 비워두세요.", languagesDescription: "휴리스틱 패턴을 선택한 언어(DE/EN/ZH)로만합니다. 비어 있음 = 샘플에서 자동 감지. 동질적인 코퍼스는 명시적으로 설정하여 언어 간 오탐 방지.",
languagesPlaceholder: "자동 감지", languagesPlaceholder: "자동 감지",
languageOptions: { languageOptions: {
de: "독일어", de: "독일어",

View File

@@ -2088,14 +2088,14 @@ export default {
}, },
chunking: { chunking: {
title: 'Настройки разбиения', title: 'Настройки разбиения',
description: 'Настройте параметры разбиения документов для улучшения качества поиска', description: 'Управляет тем, как загруженные документы разбиваются перед эмбеддингом. Значения по умолчанию подходят для большинства случаев — настройте только если качество поиска плохое.',
sizeLabel: 'Размер блока', sizeLabel: 'Размер блока',
sizeDescription: 'Определяет количество символов в каждом блоке (100-4000)', sizeDescription: 'Максимальное количество символов в блоке (1004000). По умолчанию 512 ≈ 100130 английских токенов. Меньше для FAQ (200400), больше для повествовательных документов (10002000).',
characters: 'символов', characters: 'символов',
overlapLabel: 'Перекрытие блоков', overlapLabel: 'Перекрытие блоков',
overlapDescription: 'Количество перекрывающихся символов между соседними блоками (0-500)', overlapDescription: 'Количество символов, общих для соседних блоков (0500). По умолчанию 80 ≈ 15% размера — оптимум по текущим исследованиям. 0 для FAQ/структурированных данных, 150200 для длинных повествований.',
separatorsLabel: 'Разделители', separatorsLabel: 'Разделители',
separatorsDescription: 'Разделители, используемые при разбиении документов', separatorsDescription: 'Символы или строки, которые сплиттер предпочитает при резке. Разделители более высокого приоритета пробуются первыми; порядок по умолчанию: абзацы → предложения → пунктуация.',
separatorsPlaceholder: 'Выберите или настройте разделители', separatorsPlaceholder: 'Выберите или настройте разделители',
separators: { separators: {
doubleNewline: 'Двойной перевод строки (\\n\\n)', doubleNewline: 'Двойной перевод строки (\\n\\n)',
@@ -2108,11 +2108,11 @@ export default {
space: 'Пробел ( )' space: 'Пробел ( )'
}, },
parentChildLabel: 'Родительско-дочернее разбиение', parentChildLabel: 'Родительско-дочернее разбиение',
parentChildDescription: 'Включить двухуровневую стратегию разбиения. Большие родительские блоки обеспечивают контекст, а маленькие дочерние блоки используются для векторного поиска.', parentChildDescription: 'Двухуровневое разбиение: маленькие дочерние блоки используются для векторного матчинга (точные совпадения), большой родительский блок возвращается LLM (более богатый контекст). Рекомендуется для длинных документов (>10 страниц); пропустите для коротких FAQ для экономии хранилища.',
parentChunkSizeLabel: 'Размер родительского блока', parentChunkSizeLabel: 'Размер родительского блока',
parentChunkSizeDescription: 'Размер родительских блоков для контекста (256-4096)', parentChunkSizeDescription: 'Размер контекстного блока, возвращаемого LLM (5128192). По умолчанию 4096 ≈ 1000 английских токенов, комфортно вписывается в любое современное контекстное окно.',
childChunkSizeLabel: 'Размер дочернего блока', childChunkSizeLabel: 'Размер дочернего блока',
childChunkSizeDescription: 'Размер дочерних блоков для поиска по эмбеддингам (64-1024)', childChunkSizeDescription: 'Размер встроенного блока для векторного матчинга (642048). По умолчанию 384 ≈ 80 токенов — оптимум для эмбеддеров уровня sentence-transformer / BGE.',
strategyLabel: 'Стратегия разбиения', strategyLabel: 'Стратегия разбиения',
strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.', strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.',
strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)', strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)',
@@ -2137,9 +2137,9 @@ export default {
overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.', overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.',
advancedLabel: 'Расширенные параметры', advancedLabel: 'Расширенные параметры',
tokenLimitLabel: 'Лимит токенов на блок', tokenLimitLabel: 'Лимит токенов на блок',
tokenLimitDescription: 'Ограничьте размер блока приблизительным количеством токенов. При установке размер блока автоматически уменьшается, чтобы оставаться в пределах лимита. 0 = выключено (только символьный размер).', tokenLimitDescription: 'Жёсткий лимит токенов на блок (08192). 0 = выкл (только символы). Активируйте, когда у вашего эмбеддера небольшой токен-лимит: 200 для MiniLM (256 ток), 400 для BGE/Cohere (512 ток). Современные эмбеддеры (OpenAI, Voyage, Jina-v3) поддерживают >2000 токенов — оставьте 0.',
languagesLabel: 'Языковые подсказки', languagesLabel: 'Языковые подсказки',
languagesDescription: 'Подсказка о том, какие языки должны искать паттерны определения структуры. Оставьте пустым для автоопределения.', languagesDescription: 'Ограничивает эвристические паттерны выбранными языками (DE/EN/ZH). Пусто = автоопределение из образца. Установите явно для однородных корпусов, чтобы избежать ложных срабатываний между языками.',
languagesPlaceholder: 'Автоопределение', languagesPlaceholder: 'Автоопределение',
languageOptions: { languageOptions: {
de: 'Немецкий', de: 'Немецкий',

View File

@@ -2412,14 +2412,14 @@ export default {
}, },
chunking: { chunking: {
title: "分块设置", title: "分块设置",
description: "配置文档分块参数,优化检索效果", description: "控制上传文档在嵌入前的切分方式。默认值适用于大多数场景,仅在检索质量异常时调整。",
sizeLabel: "分块大小", sizeLabel: "分块大小",
sizeDescription: "控制每个文档分块的字符数100-4000", sizeDescription: "每个分块的最大字符数100-4000。默认 512 ≈ 中文 300 tokens / 英文 100-130 tokens。FAQ 用 200-400叙述性长文档用 1000-2000。",
characters: "字符", characters: "字符",
overlapLabel: "分块重叠", overlapLabel: "分块重叠",
overlapDescription: "相邻文档块之间的重叠字符数0-500", overlapDescription: "相邻块之间共享的字符数0-500。默认 80 ≈ 分块大小的 15%符合当前研究推荐。FAQ/结构化数据用 0长篇叙述用 150-200。",
separatorsLabel: "分隔符", separatorsLabel: "分隔符",
separatorsDescription: "文档分块时使用的分隔符", separatorsDescription: "切分时优先使用的字符或字符串。优先级高的分隔符先尝试;默认顺序优先段落 → 句子 → 标点。",
separatorsPlaceholder: "选择或自定义分隔符", separatorsPlaceholder: "选择或自定义分隔符",
separators: { separators: {
doubleNewline: "双换行 (\\n\\n)", doubleNewline: "双换行 (\\n\\n)",
@@ -2432,11 +2432,11 @@ export default {
space: "空格 ( )", space: "空格 ( )",
}, },
parentChildLabel: "父子分块", parentChildLabel: "父子分块",
parentChildDescription: "启用两级父子分块策略。大的父块提供上下文,小的子块用于向量匹配检索。", parentChildDescription: "两级分块:小的子块用于向量匹配(精准命中),大的父块返回给 LLM更丰富上下文。建议用于长文档>10 页);短 FAQ 可关闭以节省存储。",
parentChunkSizeLabel: "父块大小", parentChunkSizeLabel: "父块大小",
parentChunkSizeDescription: "提供上下文的父块字符数256-4096", parentChunkSizeDescription: "返回给 LLM 的上下文块大小512-8192。默认 4096 ≈ 1000 英文 tokens适合所有现代 LLM 上下文窗口。",
childChunkSizeLabel: "子块大小", childChunkSizeLabel: "子块大小",
childChunkSizeDescription: "用于向量匹配的子块字符数64-1024", childChunkSizeDescription: "用于向量匹配的嵌入块大小64-2048。默认 384 ≈ 80 tokens是 sentence-transformer / BGE 类嵌入模型的最佳点。",
strategyLabel: "分块策略", strategyLabel: "分块策略",
strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。", strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。",
strategyPlaceholder: "选择策略(默认使用经典递归分块)", strategyPlaceholder: "选择策略(默认使用经典递归分块)",
@@ -2461,9 +2461,9 @@ export default {
overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。", overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。",
advancedLabel: "高级选项", advancedLabel: "高级选项",
tokenLimitLabel: "每块 Token 上限", tokenLimitLabel: "每块 Token 上限",
tokenLimitDescription: "按近似 Token 数限制分块大小。设置后会自动缩小分块以保持在 Token 预算内。0 = 关闭(仅按字符数)。", tokenLimitDescription: "每个分块的硬性 Token 上限0-8192。0 = 关闭(仅按字符数)。当嵌入模型 Token 上限较小时启用MiniLM (256 tok) 用 200BGE/Cohere (512 tok) 用 400。现代嵌入器OpenAI、Voyage、Jina-v3支持 >2000 tokens保持 0 即可。",
languagesLabel: "语言提示", languagesLabel: "语言提示",
languagesDescription: "提示结构检测模式识别哪些语言。留空则自动检测。", languagesDescription: "限制启发式模式识别选定的语言DE/EN/ZH。留空 = 自动检测。同质化语料库可显式设置以避免跨语言误匹配。",
languagesPlaceholder: "自动检测", languagesPlaceholder: "自动检测",
languageOptions: { languageOptions: {
de: "德语", de: "德语",

View File

@@ -409,10 +409,13 @@ const WIKI_ONLY_CHUNKING_PRESET = {
enableParentChild: false, enableParentChild: false,
} as const } as const
// Wiki-only 场景下回落到的默认值(与 initFormData 保持一致)。 // Non-Wiki-only fallback. Mirrors chunker.DefaultChunkSize and
// DefaultChunkOverlap on the backend so a freshly created KB uses
// the same numbers whether the editor sets them or the splitter
// falls back to its package defaults.
const DEFAULT_CHUNKING_PRESET = { const DEFAULT_CHUNKING_PRESET = {
chunkSize: 512, chunkSize: 512,
chunkOverlap: 100, chunkOverlap: 80,
enableParentChild: true, enableParentChild: true,
} as const } as const
@@ -485,7 +488,9 @@ const initFormData = (type: 'document' | 'faq' = 'document') => {
}, },
chunkingConfig: { chunkingConfig: {
chunkSize: 512, chunkSize: 512,
chunkOverlap: 100, // 80 ≈ 15% of chunkSize — community-recommended sweet spot.
// Aligned with chunker.DefaultChunkOverlap on the backend.
chunkOverlap: 80,
separators: ['\n\n', '\n', '。', '', '', ';', ''], separators: ['\n\n', '\n', '。', '', '', ';', ''],
parserEngineRules: undefined as any, parserEngineRules: undefined as any,
enableParentChild: true, enableParentChild: true,
@@ -586,7 +591,9 @@ const loadKBData = async () => {
}, },
chunkingConfig: { chunkingConfig: {
chunkSize: kb.chunking_config?.chunk_size || 512, chunkSize: kb.chunking_config?.chunk_size || 512,
chunkOverlap: kb.chunking_config?.chunk_overlap || 100, // Fallback only used when the loaded KB has no chunk_overlap stored.
// Aligned with chunker.DefaultChunkOverlap on the backend.
chunkOverlap: kb.chunking_config?.chunk_overlap || 80,
separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '', '', ';', ''], separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '', '', ';', ''],
parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined, parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined,
enableParentChild: kb.chunking_config?.enable_parent_child || false, enableParentChild: kb.chunking_config?.enable_parent_child || false,

View File

@@ -217,6 +217,19 @@ interface ParserEngineRule {
engine: string engine: string
} }
// Slider ranges defined in this file (min/max props on t-slider) mirror
// the validated bounds in the backend splitter:
// ChunkSize: 1004000 (default 512). 100 = too fragmented to be
// useful; 4000 = approaches the 7500-char absoluteMaxSize
// that the splitter hard-caps to anyway.
// ChunkOverlap: 0500 (default 80). Backend caps to ChunkSize/2
// when set higher than that.
// ParentChunkSize: 5128192 (default 4096 ≈ 1000 EN tokens).
// ChildChunkSize: 642048 (default 384 ≈ 80 EN tokens, sweet spot for
// sentence-transformer / BGE embedders).
// TokenLimit: 08192 (default 0 = off, char-based budget only).
// Set to 200 for MiniLM (256-tok limit), 400 for BGE/
// Cohere (512-tok), leave at 0 for OpenAI/Voyage/Jina-v3.
interface ChunkingConfig { interface ChunkingConfig {
chunkSize: number chunkSize: number
chunkOverlap: number chunkOverlap: number
@@ -225,11 +238,11 @@ interface ChunkingConfig {
enableParentChild: boolean enableParentChild: boolean
parentChunkSize: number parentChunkSize: number
childChunkSize: number childChunkSize: number
// New: adaptive chunking strategy. Empty string = legacy / not set. // Adaptive chunking strategy. Empty string = legacy / not set.
strategy?: string strategy?: string
// New: cap chunk size in approx tokens. 0 = char-based budget only. // Cap chunk size in approx tokens. 0 = char-based budget only.
tokenLimit?: number tokenLimit?: number
// New: language hints for heuristic patterns (de/en/zh). // Language hints for heuristic patterns (de/en/zh).
languages?: string[] languages?: string[]
} }

View File

@@ -66,19 +66,33 @@ type SplitterConfig struct {
Languages []string Languages []string
} }
// Default sizes used by all entry points (DefaultConfig, ensureDefaults, // Default chunk sizing constants. Single source of truth for the entire
// and buildSplitterConfig in the knowledge service). // chunker package and (via knowledge.go::buildSplitterConfig) the
// knowledge service. The frontend KnowledgeBaseEditorModal mirrors these
// numbers in its initial form state — keep them in sync if you change
// either value here.
//
// DefaultChunkSize = 512 chars: ~100130 English tokens / ~300 Chinese
// tokens. Validated as a strong baseline by the Vecta Feb-2026 benchmark
// across 50 academic papers. Use 200400 for FAQ-style atomic content,
// 10002000 for narrative / argumentative documents.
//
// DefaultChunkOverlap = 80 chars (≈15% of DefaultChunkSize): community-
// recommended sweet spot between recall (an answer split across a
// boundary needs overlap to be retrievable) and storage cost. Use 0 for
// strictly atomic data (FAQ, JSON records), 150200 for long narratives
// where reasoning crosses chunks.
// //
// MIGRATION NOTE: Prior versions had three different overlap defaults // MIGRATION NOTE: Prior versions had three different overlap defaults
// (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python // (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python
// docreader: 100). This file is now the single source of truth at 80 // docreader: 100). All consolidated to 80 here.
// (≈15% of DefaultChunkSize) — a community-recommended sweet spot.
// //
// Existing knowledge bases that stored ChunkOverlap=0 in the DB will pick // Existing knowledge bases that stored ChunkOverlap=0 in the DB pick
// up 80 on next re-index; their previously-indexed embeddings will not // this 80 up on next re-index; their previously-indexed embeddings will
// match new ones bit-for-bit. Recall stays similar but search ranking // not match new ones bit-for-bit. Recall stays similar but search
// can shift slightly. To freeze the old behavior on a per-KB basis, // ranking can shift slightly. To freeze the old behavior on a per-KB
// explicitly set ChunkingConfig.ChunkOverlap to 64 before re-indexing. // basis, explicitly set ChunkingConfig.ChunkOverlap to 64 before
// re-indexing.
const ( const (
DefaultChunkSize = 512 DefaultChunkSize = 512
DefaultChunkOverlap = 80 DefaultChunkOverlap = 80