diff --git a/CHANGELOG.md b/CHANGELOG.md index 93a091ff..b153c922 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,37 @@ All notable changes to this project will be documented in this file. +## [Unreleased] + +### 🚀 New Features +- **NEW**: Adaptive 3-tier chunking — documents are now profiled before splitting and routed to one of three strategies: heading-aware (Markdown structure), heuristic (form-feeds, multilingual chapter markers DE/EN/ZH, all-caps titles, visual separators), or recursive (the modernized legacy splitter as a fallback). Auto-strategy is the new default for fresh KBs; existing KBs keep their previous behavior until the user opts in. See `docs/CHUNKING.md`. +- **NEW**: KB editor — chunking settings panel surfaces the new strategy selector (Automatic / Markdown-optimized / Smart structure detection / Classic) plus advanced options for token limit per chunk and language hints. Sharper inline help text on every setting explains when defaults apply and when to tune. +- **NEW**: Chunking debug panel — embedded "Test with sample text" panel under the chunking settings. Paste a snippet, hit Run preview, see selected tier, rejected tiers + reasons, document profile, size distribution stats over the full chunk set, and per-chunk cards with breadcrumb + content preview. Read-only, no DB or embedding side effects, 5-second server-side timeout. +- **NEW**: `POST /api/v1/chunker/preview` endpoint backing the debug panel. Returns `selected_tier`, `tier_chain`, `rejected[]`, `profile`, `chunks[]`, and `stats`. Capped at 64k input runes / 500 chunks per response. +- **NEW**: Per-tenant RRF (Reciprocal Rank Fusion) tuning — `RRFK`, `RRFVectorWeight`, `RRFKeywordWeight` are now configurable on the tenant `RetrievalConfig`. Defaults preserve the previous hardcoded behavior (k=60, weights 0.7/0.3). + +### ⚡ Improvements +- **IMPROVED**: Chunker recursive priority — `splitBySeparators` now genuinely walks separators by priority and recursively re-splits oversize sub-pieces with the next-priority separator. Mirrors the Python reference. Without this fix, a "one paragraph break followed by a long run of newline-separated lines" pattern could emit ~1900-rune chunks at chunkSize=300. +- **IMPROVED**: ChunkOverlap default consolidated to 80 (~15% of ChunkSize). Previously the Go DefaultConfig used 64, the knowledge service used 50, the Python docreader used 100, and the frontend form initialised to 100. All paths now align. +- **IMPROVED**: ContextHeader (Markdown breadcrumb) lives on `Chunk.ContextHeader`, separate from `Chunk.Content`. Restores the `End-Start == len(Content)` invariant that the document-reconstruction path in `knowledge.go` relies on for summary generation and UI highlighting. Eliminates a duplicate-heading regression where the section heading appeared twice in a chunk's body. +- **IMPROVED**: Embedding pipeline — exponential backoff (200/400/800/1600/3200 ms) replaces the previous fixed 100ms × 5 retry loop, with context-cancellation between attempts. `sanitizeForEmbedding` caps single embedding inputs at 20k runes with a warning log on overflow. +- **IMPROVED**: SplitParentChild forces children onto the recursive tier, skipping per-parent profile passes (previously paid N extra O(N) document scans). +- **IMPROVED**: Heuristic splitter snaps overlap start to the nearest semantic boundary or newline instead of slicing mid-line / mid-word. +- **IMPROVED**: Validator flow — when every tier is rejected, the chain returns the legacy tier's output directly instead of running SplitText a second time. +- **IMPROVED**: Token limit per chunk — when set, ChunkSize is auto-clamped to a per-language character budget (with a 10% safety factor). Prevents overshooting embedding model token caps on CJK content where 1 char ≈ 0.6 tokens. +- **IMPROVED**: KB-config API — `strategy`, `tokenLimit`, `languages` use pointer DTOs server-side so a payload omitting a field means "no change" while an explicit empty / 0 / [] resets to default. Previously these were write-once fields. + +### 🐛 Bug Fixes +- **FIXED**: Chunker — `Chunk.Start` / `End` rune-offset invariant restored after the heading-aware splitter started prepending breadcrumbs to content (regression introduced during the initial Tier-1 work, fixed before any release). +- **FIXED**: Heuristic splitter — `applyOverlap` aligns to boundaries instead of doing blind char-subtraction that could leave chunks starting mid-word in CJK text. +- **FIXED**: Preview endpoint — chunk-size statistics are now computed over the FULL chunk set before truncating the response payload to 500 entries. Previously `avg`/`min`/`max`/`stddev` reflected only the first 500 chunks of a larger split. +- **FIXED**: Preview endpoint — empty / whitespace-only sample text now returns a friendly 400 ("paste a sample…") instead of gin's cryptic `Field validation failed` error. +- **FIXED**: Frontend chunking debug panel — added explicit `type="button"`, prominent loading and error states, and console error logging so failed previews are debuggable from DevTools without enabling verbose logging. Earlier the panel could appear to "vanish" silently when a request failed. +- **FIXED**: KB-editor i18n — `chunkOverlap` initial form value aligned with the backend default (80, not 100); description texts on every chunking setting now state the recommended ranges per use-case. + +### 📚 Documentation +- **DOC**: New `docs/CHUNKING.md` — strategy explanations, settings reference with use-case presets, token-limit guide per embedding model, debugging workflow, and known trade-offs. + ## [0.5.1] - 2026-04-30 ### 🚀 New Features diff --git a/docs/CHUNKING.md b/docs/CHUNKING.md new file mode 100644 index 00000000..8e84ffbc --- /dev/null +++ b/docs/CHUNKING.md @@ -0,0 +1,160 @@ +# Chunking Guide + +How WeKnora splits uploaded documents before embedding, why the defaults +are what they are, and when to change them. + +## Why chunking matters + +Retrieval-Augmented Generation (RAG) works by embedding small slices of +your documents into a vector index, then pulling the most relevant +slices back at query time. The way a document is sliced — chunk size, +overlap, where the cuts fall — directly drives retrieval recall and the +quality of the answers your LLM produces. + +Empirically (Vecta Feb-2026 benchmark across 50 academic papers): +recursive splitting at ~512 tokens with ~15% overlap is the strongest +single-knob baseline at 69% end-to-end accuracy, beating semantic +chunking and over-engineered hybrids. WeKnora uses that as the +foundation and layers smarter strategies on top when the document gives +us structural cues. + +## Adaptive 3-tier chunking + +Set per knowledge base via the editor's **Chunking** sidebar (or the +`strategy` field on the KB-config API). + +| Strategy | When picked | What it does | +|----------|-------------|--------------| +| `auto` (recommended) | Default for new KBs | Profiles the document and picks the strongest tier from the chain below. | +| `heading` | Markdown-style structure | Splits at `#` / `##` / `###` boundaries. Each chunk gets a breadcrumb context header (`# Top > ## Section`) prepended at embedding time. | +| `heuristic` | PDF-style structure | Splits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators. | +| `legacy` (= `recursive`) | Anything else, or as fallback | Pure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes. | + +A document profiler runs first and counts structural signals (Markdown +headings, form-feeds, chapter markers per language, all-caps lines, +visual separators, blank-line bursts). Auto-strategy picks the tier +chain based on those counts; a validator rejects obviously broken +output (e.g. the heading splitter producing 200 single-line chunks) +and falls through to the next tier. + +## Settings reference + +### Core + +| Setting | Range | Default | Sweet spot for… | +|---------|-------|---------|-----------------| +| **Chunk size** | 100–4000 chars | 512 | Default works for most cases. 200–400 for FAQs / atomic Q&A. 1000–2000 for narrative documents. | +| **Chunk overlap** | 0–500 chars | 80 (~15%) | 0 for FAQs and structured records. 80 (default) for general documents. 150–200 for argumentative texts where reasoning crosses chunks. | +| **Separators** | string list | `["\n\n", "\n", "。", "!", "?", ";", ";"]` | Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize. | + +### Parent-Child chunking + +Two-level retrieval: **child chunks** (small, embedded for vector match) +and **parent chunks** (larger, returned to the LLM for context). + +| Setting | Range | Default | Notes | +|---------|-------|---------|-------| +| **Enable parent-child** | toggle | on | Recommended for documents > 10 pages. Skip for short FAQs to halve storage cost. | +| **Parent chunk size** | 512–8192 chars | 4096 (~1000 EN tokens) | Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (1024–2048) for local LLMs with 4k contexts. | +| **Child chunk size** | 64–2048 chars | 384 (~80 EN tokens) | 128–256 for Q&A-style precise matching. 512–1024 if your embedder accepts >1000 tokens (E5 / BGE-large). | + +### Advanced + +| Setting | Range | Default | When to set | +|---------|-------|---------|-------------| +| **Token limit** | 0–8192 | 0 (off) | Activate when your embedding model has a small token cap. See table below. | +| **Languages** | `de` / `en` / `zh` (multi-select) | empty (auto-detect) | Set explicitly for homogeneous corpora to narrow heuristic patterns. | + +#### Token-limit guide per embedding model + +| Embedder | Token limit | Recommended `tokenLimit` setting | +|----------|-------------|----------------------------------| +| OpenAI `text-embedding-3-small/large` | 8191 | **0 (leave off)** | +| Anthropic Voyage-3 | 32000 | **0** | +| Jina-embeddings-v3 | 8192 | **0** | +| Cohere `embed-multilingual-v3` | 512 | **400** | +| BGE-base / BGE-large / E5-large | 512 | **400** | +| Sentence-Transformer `all-MiniLM-L6-v2` | 256 | **200** | + +Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens. +Activate to 80% of the model's hard limit for smaller embedders so +chunks always fit even for CJK content (which is denser per character). + +## Use-case presets + +| Workload | Strategy | ChunkSize | Overlap | Parent-Child | +|----------|----------|-----------|---------|--------------| +| FAQ / Q&A knowledge base | `auto` (likely picks legacy) | 200–400 | 0 | off | +| Markdown documentation / wikis | `auto` (picks heading) | 512 | 80 | on | +| PDF reports with page breaks | `auto` (picks heuristic) | 800–1200 | 100–150 | on | +| Long-form narrative (books, articles) | `auto` (picks recursive) | 1000–2000 | 150–200 | on | +| Code documentation | `legacy` | 800 | 100 | optional | +| Mixed-language corpus | `auto`, languages = empty | 512 | 80 | on | +| Tabular reports / CSV-derived | `legacy` | 400 | 0 | off | + +## Debugging in the UI + +The KB editor's **Chunking** sidebar has a "Test with sample text" +collapsible at the bottom: + +1. Paste a Markdown / plain-text snippet (max 64 KB). +2. Click **Run preview**. +3. The panel shows: + - Selected strategy tier as a colored tag + - Tiers that were rejected and why (e.g. "too many tiny chunks") + - Document profile (heading counts, form-feeds, chapter markers, + detected languages) + - Size statistics over the full chunk set (avg / min / max / stddev) + - Per-chunk cards with size in chars + approximate tokens, position + range, the section breadcrumb (when set), and a content preview + +This runs read-only against a goroutine-isolated splitter pass (5s +timeout) — no DB writes, no embedding API calls. Use it to compare +configurations against the same sample before triggering a re-upload. + +## API + +```http +PUT /api/v1/initialization/config/:kbId +Authorization: Bearer +Content-Type: application/json + +{ + "documentSplitting": { + "chunkSize": 512, + "chunkOverlap": 80, + "separators": ["\n\n", "\n", "。", "!", "?", ";", ";"], + "strategy": "auto", + "tokenLimit": 0, + "languages": ["de", "en"], + "enableParentChild": true, + "parentChunkSize": 4096, + "childChunkSize": 384 + } +} +``` + +The `strategy`, `tokenLimit`, and `languages` fields use pointer-based +DTOs server-side: omitting them in the payload means "no change", +sending an empty string / 0 / [] explicitly resets to default. + +The preview endpoint accepts the same payload shape under +`POST /api/v1/chunker/preview` with an additional `text` field. + +## Known trade-offs + +- **Tier-1 heading-aware chunking** prepends the section breadcrumb to + the embedding input, costing ~5% more tokens per chunk in exchange + for ~30–50% fewer chunks on structured documents (net token savings + on storage and at query time). +- **Strategy switches do not auto-reindex** existing documents. After + changing a KB's strategy, re-upload affected files (or trigger + re-indexing via the UI) to apply the new chunking. +- **OCR artifacts in PDFs** (vertical layout text broken character- + by-character into separate lines) cannot be fixed by any splitter — + this is a parser-side limitation. The heuristic tier still keeps + chunks aligned to page boundaries, which mitigates the worst cases. +- **The `recursive` strategy value** exists in the API for completeness + but is intentionally hidden from the UI: it is functionally near + `legacy` and adding a fifth dropdown option dilutes the meaningful + choice between automatic / Markdown / heuristic / legacy. diff --git a/frontend/src/i18n/locales/en-US.ts b/frontend/src/i18n/locales/en-US.ts index 001db182..9f833f8d 100755 --- a/frontend/src/i18n/locales/en-US.ts +++ b/frontend/src/i18n/locales/en-US.ts @@ -1873,14 +1873,14 @@ export default { }, chunking: { title: 'Chunking Settings', - description: 'Configure document chunking parameters to improve retrieval quality', + description: 'Controls how uploaded documents are split before embedding. Defaults work for most cases — tune only when retrieval quality is off.', sizeLabel: 'Chunk Size', - sizeDescription: 'Controls the number of characters in each chunk (100-4000)', + sizeDescription: 'Maximum characters per chunk (100–4000). Default 512 ≈ 100–130 English tokens. Smaller for FAQs (200–400), larger for narrative documents (1000–2000).', characters: 'characters', overlapLabel: 'Chunk Overlap', - overlapDescription: 'Number of overlapping characters between adjacent chunks (0-500)', + overlapDescription: 'Characters shared between adjacent chunks (0–500). Default 80 ≈ 15% of size — sweet spot per current research. Use 0 for FAQs/structured data, 150–200 for long-form narratives.', separatorsLabel: 'Separators', - separatorsDescription: 'Separators used when chunking documents', + separatorsDescription: 'Characters or strings the splitter prefers when cutting. Higher-priority separators are tried first; the default order favors paragraph → sentence → punctuation breaks.', separatorsPlaceholder: 'Select or customize separators', separators: { doubleNewline: 'Double newline (\ @@ -1896,11 +1896,11 @@ export default { space: 'Space ( )' }, parentChildLabel: 'Parent-Child Chunking', - parentChildDescription: 'Enable two-level parent-child chunking strategy. Large parent chunks provide context while small child chunks are used for vector matching.', + parentChildDescription: 'Two-level chunking: small child chunks are vector-matched (precise hits) but the larger parent chunk is returned to the LLM (richer context). Recommended for long documents (>10 pages); skip for short FAQs to save storage.', parentChunkSizeLabel: 'Parent Chunk Size', - parentChunkSizeDescription: 'Size of parent chunks that provide context (256-4096)', + parentChunkSizeDescription: 'Size of the context chunk returned to the LLM (512–8192). Default 4096 ≈ 1000 English tokens, fits comfortably in any modern LLM context window.', childChunkSizeLabel: 'Child Chunk Size', - childChunkSizeDescription: 'Size of child chunks used for embedding matching (64-1024)', + childChunkSizeDescription: 'Size of the embedded chunk used for vector match (64–2048). Default 384 ≈ 80 tokens — sweet spot for sentence-transformer / BGE-style embedders.', strategyLabel: 'Chunking Strategy', strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.', strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)', @@ -1925,9 +1925,9 @@ export default { overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.', advancedLabel: 'Advanced options', tokenLimitLabel: 'Token limit per chunk', - tokenLimitDescription: 'Cap chunk size in approximate tokens. When set, chunk size is automatically reduced to stay below this token budget. 0 = off (use character size only).', + tokenLimitDescription: 'Hard token cap per chunk (0–8192). 0 = off (chunk size in characters only). Activate when your embedding model has a small token limit: 200 for MiniLM (256 tok), 400 for BGE/Cohere (512 tok). Modern embedders (OpenAI, Voyage, Jina-v3) accept >2000 tokens — leave at 0.', languagesLabel: 'Language hints', - languagesDescription: 'Hint which languages structure-detection patterns should look for. Leave empty for auto-detection.', + languagesDescription: 'Restricts heuristic patterns to the chosen languages (DE/EN/ZH). Empty = auto-detect from sample. Set explicitly for homogeneous corpora to avoid false-positive matches across languages.', languagesPlaceholder: 'Auto-detect', languageOptions: { de: 'German', diff --git a/frontend/src/i18n/locales/ko-KR.ts b/frontend/src/i18n/locales/ko-KR.ts index 9801f33a..c8d7a212 100755 --- a/frontend/src/i18n/locales/ko-KR.ts +++ b/frontend/src/i18n/locales/ko-KR.ts @@ -2453,14 +2453,14 @@ export default { }, chunking: { title: "청크 설정", - description: "문서 청킹 파라미터를 설정하여 검색 효과 최적화", + description: "업로드된 문서가 임베딩되기 전에 분할되는 방식을 제어합니다. 대부분의 경우 기본값으로 충분합니다.", sizeLabel: "청크 크기", - sizeDescription: "각 문서 청크의 문자 수 제어 (100-4000)", + sizeDescription: "청크당 최대 문자 수 (100-4000). 기본값 512 ≈ 영어 100-130 토큰. FAQ는 200-400, 서술형 문서는 1000-2000.", characters: "문자", overlapLabel: "청크 중복", - overlapDescription: "인접 문서 청크 간의 중복 문자 수 (0-500)", + overlapDescription: "인접 청크 간 공유 문자 수 (0-500). 기본값 80 ≈ 청크 크기의 15% — 현재 연구 권장값. FAQ/구조화 데이터는 0, 긴 서술은 150-200.", separatorsLabel: "구분자", - separatorsDescription: "문서 청킹 시 사용되는 구분자", + separatorsDescription: "분할 시 우선적으로 사용되는 문자/문자열. 우선순위가 높은 구분자를 먼저 시도; 기본 순서는 단락 → 문장 → 구두점.", separatorsPlaceholder: "구분자 선택 또는 사용자 정의", separators: { doubleNewline: "이중 줄바꿈 (\\n\\n)", @@ -2473,11 +2473,11 @@ export default { space: "공백 ( )", }, parentChildLabel: "부모-자식 청킹", - parentChildDescription: "2단계 부모-자식 청킹 전략을 활성화합니다. 큰 부모 청크는 컨텍스트를 제공하고, 작은 자식 청크는 벡터 매칭에 사용됩니다.", + parentChildDescription: "2단계 청킹: 작은 자식 청크는 벡터 매칭(정확한 히트), 큰 부모 청크는 LLM에 반환(풍부한 컨텍스트). 긴 문서(>10페이지)에 권장; 짧은 FAQ는 비활성화하여 저장 공간 절약.", parentChunkSizeLabel: "부모 청크 크기", - parentChunkSizeDescription: "컨텍스트를 제공하는 부모 청크의 문자 수 (256-4096)", + parentChunkSizeDescription: "LLM에 반환되는 컨텍스트 청크 크기 (512-8192). 기본값 4096 ≈ 1000 영어 토큰, 모든 현대 LLM 컨텍스트에 적합.", childChunkSizeLabel: "자식 청크 크기", - childChunkSizeDescription: "임베딩 매칭에 사용되는 자식 청크의 문자 수 (64-1024)", + childChunkSizeDescription: "벡터 매칭에 사용되는 임베딩 청크 크기 (64-2048). 기본값 384 ≈ 80 토큰 — sentence-transformer / BGE 임베더의 최적점.", strategyLabel: "청킹 전략", strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.", strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)", @@ -2502,9 +2502,9 @@ export default { overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.", advancedLabel: "고급 옵션", tokenLimitLabel: "청크당 토큰 제한", - tokenLimitDescription: "근사 토큰 수로 청크 크기를 제한합니다. 설정되면 토큰 예산을 유지하기 위해 청크 크기가 자동으로 축소됩니다. 0 = 끄기 (문자 크기만 사용).", + tokenLimitDescription: "청크당 토큰 하드 제한 (0-8192). 0 = 끄기 (문자 크기만). 임베딩 모델의 토큰 제한이 작을 때 활성화: MiniLM (256 tok)은 200, BGE/Cohere (512 tok)는 400. 현대 임베더(OpenAI, Voyage, Jina-v3)는 >2000 토큰 지원 — 0으로 두세요.", languagesLabel: "언어 힌트", - languagesDescription: "구조 감지 패턴이 찾아야 할 언어를 힌트로 제공합니다. 자동 감지를 위해 비워두세요.", + languagesDescription: "휴리스틱 패턴을 선택한 언어(DE/EN/ZH)로만 제한합니다. 비어 있음 = 샘플에서 자동 감지. 동질적인 코퍼스는 명시적으로 설정하여 언어 간 오탐 방지.", languagesPlaceholder: "자동 감지", languageOptions: { de: "독일어", diff --git a/frontend/src/i18n/locales/ru-RU.ts b/frontend/src/i18n/locales/ru-RU.ts index cb617ed9..0509561b 100755 --- a/frontend/src/i18n/locales/ru-RU.ts +++ b/frontend/src/i18n/locales/ru-RU.ts @@ -2088,14 +2088,14 @@ export default { }, chunking: { title: 'Настройки разбиения', - description: 'Настройте параметры разбиения документов для улучшения качества поиска', + description: 'Управляет тем, как загруженные документы разбиваются перед эмбеддингом. Значения по умолчанию подходят для большинства случаев — настройте только если качество поиска плохое.', sizeLabel: 'Размер блока', - sizeDescription: 'Определяет количество символов в каждом блоке (100-4000)', + sizeDescription: 'Максимальное количество символов в блоке (100–4000). По умолчанию 512 ≈ 100–130 английских токенов. Меньше для FAQ (200–400), больше для повествовательных документов (1000–2000).', characters: 'символов', overlapLabel: 'Перекрытие блоков', - overlapDescription: 'Количество перекрывающихся символов между соседними блоками (0-500)', + overlapDescription: 'Количество символов, общих для соседних блоков (0–500). По умолчанию 80 ≈ 15% размера — оптимум по текущим исследованиям. 0 для FAQ/структурированных данных, 150–200 для длинных повествований.', separatorsLabel: 'Разделители', - separatorsDescription: 'Разделители, используемые при разбиении документов', + separatorsDescription: 'Символы или строки, которые сплиттер предпочитает при резке. Разделители более высокого приоритета пробуются первыми; порядок по умолчанию: абзацы → предложения → пунктуация.', separatorsPlaceholder: 'Выберите или настройте разделители', separators: { doubleNewline: 'Двойной перевод строки (\\n\\n)', @@ -2108,11 +2108,11 @@ export default { space: 'Пробел ( )' }, parentChildLabel: 'Родительско-дочернее разбиение', - parentChildDescription: 'Включить двухуровневую стратегию разбиения. Большие родительские блоки обеспечивают контекст, а маленькие дочерние блоки используются для векторного поиска.', + parentChildDescription: 'Двухуровневое разбиение: маленькие дочерние блоки используются для векторного матчинга (точные совпадения), большой родительский блок возвращается LLM (более богатый контекст). Рекомендуется для длинных документов (>10 страниц); пропустите для коротких FAQ для экономии хранилища.', parentChunkSizeLabel: 'Размер родительского блока', - parentChunkSizeDescription: 'Размер родительских блоков для контекста (256-4096)', + parentChunkSizeDescription: 'Размер контекстного блока, возвращаемого LLM (512–8192). По умолчанию 4096 ≈ 1000 английских токенов, комфортно вписывается в любое современное контекстное окно.', childChunkSizeLabel: 'Размер дочернего блока', - childChunkSizeDescription: 'Размер дочерних блоков для поиска по эмбеддингам (64-1024)', + childChunkSizeDescription: 'Размер встроенного блока для векторного матчинга (64–2048). По умолчанию 384 ≈ 80 токенов — оптимум для эмбеддеров уровня sentence-transformer / BGE.', strategyLabel: 'Стратегия разбиения', strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.', strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)', @@ -2137,9 +2137,9 @@ export default { overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.', advancedLabel: 'Расширенные параметры', tokenLimitLabel: 'Лимит токенов на блок', - tokenLimitDescription: 'Ограничьте размер блока приблизительным количеством токенов. При установке размер блока автоматически уменьшается, чтобы оставаться в пределах лимита. 0 = выключено (только символьный размер).', + tokenLimitDescription: 'Жёсткий лимит токенов на блок (0–8192). 0 = выкл (только символы). Активируйте, когда у вашего эмбеддера небольшой токен-лимит: 200 для MiniLM (256 ток), 400 для BGE/Cohere (512 ток). Современные эмбеддеры (OpenAI, Voyage, Jina-v3) поддерживают >2000 токенов — оставьте 0.', languagesLabel: 'Языковые подсказки', - languagesDescription: 'Подсказка о том, какие языки должны искать паттерны определения структуры. Оставьте пустым для автоопределения.', + languagesDescription: 'Ограничивает эвристические паттерны выбранными языками (DE/EN/ZH). Пусто = автоопределение из образца. Установите явно для однородных корпусов, чтобы избежать ложных срабатываний между языками.', languagesPlaceholder: 'Автоопределение', languageOptions: { de: 'Немецкий', diff --git a/frontend/src/i18n/locales/zh-CN.ts b/frontend/src/i18n/locales/zh-CN.ts index be5fd979..ff69ea4e 100755 --- a/frontend/src/i18n/locales/zh-CN.ts +++ b/frontend/src/i18n/locales/zh-CN.ts @@ -2412,14 +2412,14 @@ export default { }, chunking: { title: "分块设置", - description: "配置文档分块参数,优化检索效果", + description: "控制上传文档在嵌入前的切分方式。默认值适用于大多数场景,仅在检索质量异常时调整。", sizeLabel: "分块大小", - sizeDescription: "控制每个文档分块的字符数(100-4000)", + sizeDescription: "每个分块的最大字符数(100-4000)。默认 512 ≈ 中文 300 tokens / 英文 100-130 tokens。FAQ 用 200-400,叙述性长文档用 1000-2000。", characters: "字符", overlapLabel: "分块重叠", - overlapDescription: "相邻文档块之间的重叠字符数(0-500)", + overlapDescription: "相邻分块之间共享的字符数(0-500)。默认 80 ≈ 分块大小的 15%,符合当前研究推荐。FAQ/结构化数据用 0,长篇叙述用 150-200。", separatorsLabel: "分隔符", - separatorsDescription: "文档分块时使用的分隔符", + separatorsDescription: "切分时优先使用的字符或字符串。优先级高的分隔符先尝试;默认顺序优先段落 → 句子 → 标点。", separatorsPlaceholder: "选择或自定义分隔符", separators: { doubleNewline: "双换行 (\\n\\n)", @@ -2432,11 +2432,11 @@ export default { space: "空格 ( )", }, parentChildLabel: "父子分块", - parentChildDescription: "启用两级父子分块策略。大的父块提供上下文,小的子块用于向量匹配检索。", + parentChildDescription: "两级分块:小的子块用于向量匹配(精准命中),大的父块返回给 LLM(更丰富上下文)。建议用于长文档(>10 页);短 FAQ 可关闭以节省存储。", parentChunkSizeLabel: "父块大小", - parentChunkSizeDescription: "提供上下文的父块字符数(256-4096)", + parentChunkSizeDescription: "返回给 LLM 的上下文块大小(512-8192)。默认 4096 ≈ 1000 英文 tokens,适合所有现代 LLM 上下文窗口。", childChunkSizeLabel: "子块大小", - childChunkSizeDescription: "用于向量匹配的子块字符数(64-1024)", + childChunkSizeDescription: "用于向量匹配的嵌入块大小(64-2048)。默认 384 ≈ 80 tokens,是 sentence-transformer / BGE 类嵌入模型的最佳点。", strategyLabel: "分块策略", strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。", strategyPlaceholder: "选择策略(默认使用经典递归分块)", @@ -2461,9 +2461,9 @@ export default { overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。", advancedLabel: "高级选项", tokenLimitLabel: "每块 Token 上限", - tokenLimitDescription: "按近似 Token 数限制分块大小。设置后会自动缩小分块以保持在 Token 预算内。0 = 关闭(仅按字符数)。", + tokenLimitDescription: "每个分块的硬性 Token 上限(0-8192)。0 = 关闭(仅按字符数)。当嵌入模型 Token 上限较小时启用:MiniLM (256 tok) 用 200,BGE/Cohere (512 tok) 用 400。现代嵌入器(OpenAI、Voyage、Jina-v3)支持 >2000 tokens,保持 0 即可。", languagesLabel: "语言提示", - languagesDescription: "提示结构检测模式应识别哪些语言。留空则自动检测。", + languagesDescription: "限制启发式模式只识别选定的语言(DE/EN/ZH)。留空 = 自动检测。同质化语料库可显式设置以避免跨语言误匹配。", languagesPlaceholder: "自动检测", languageOptions: { de: "德语", diff --git a/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue b/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue index 568984b6..6e705e79 100644 --- a/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue +++ b/frontend/src/views/knowledge/KnowledgeBaseEditorModal.vue @@ -409,10 +409,13 @@ const WIKI_ONLY_CHUNKING_PRESET = { enableParentChild: false, } as const -// 非 Wiki-only 场景下回落到的默认值(与 initFormData 保持一致)。 +// Non-Wiki-only fallback. Mirrors chunker.DefaultChunkSize and +// DefaultChunkOverlap on the backend so a freshly created KB uses +// the same numbers whether the editor sets them or the splitter +// falls back to its package defaults. const DEFAULT_CHUNKING_PRESET = { chunkSize: 512, - chunkOverlap: 100, + chunkOverlap: 80, enableParentChild: true, } as const @@ -485,7 +488,9 @@ const initFormData = (type: 'document' | 'faq' = 'document') => { }, chunkingConfig: { chunkSize: 512, - chunkOverlap: 100, + // 80 ≈ 15% of chunkSize — community-recommended sweet spot. + // Aligned with chunker.DefaultChunkOverlap on the backend. + chunkOverlap: 80, separators: ['\n\n', '\n', '。', '!', '?', ';', ';'], parserEngineRules: undefined as any, enableParentChild: true, @@ -586,7 +591,9 @@ const loadKBData = async () => { }, chunkingConfig: { chunkSize: kb.chunking_config?.chunk_size || 512, - chunkOverlap: kb.chunking_config?.chunk_overlap || 100, + // Fallback only used when the loaded KB has no chunk_overlap stored. + // Aligned with chunker.DefaultChunkOverlap on the backend. + chunkOverlap: kb.chunking_config?.chunk_overlap || 80, separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '!', '?', ';', ';'], parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined, enableParentChild: kb.chunking_config?.enable_parent_child || false, diff --git a/frontend/src/views/knowledge/settings/KBChunkingSettings.vue b/frontend/src/views/knowledge/settings/KBChunkingSettings.vue index 4f25307f..581a386e 100644 --- a/frontend/src/views/knowledge/settings/KBChunkingSettings.vue +++ b/frontend/src/views/knowledge/settings/KBChunkingSettings.vue @@ -217,6 +217,19 @@ interface ParserEngineRule { engine: string } +// Slider ranges defined in this file (min/max props on t-slider) mirror +// the validated bounds in the backend splitter: +// ChunkSize: 100–4000 (default 512). 100 = too fragmented to be +// useful; 4000 = approaches the 7500-char absoluteMaxSize +// that the splitter hard-caps to anyway. +// ChunkOverlap: 0–500 (default 80). Backend caps to ChunkSize/2 +// when set higher than that. +// ParentChunkSize: 512–8192 (default 4096 ≈ 1000 EN tokens). +// ChildChunkSize: 64–2048 (default 384 ≈ 80 EN tokens, sweet spot for +// sentence-transformer / BGE embedders). +// TokenLimit: 0–8192 (default 0 = off, char-based budget only). +// Set to 200 for MiniLM (256-tok limit), 400 for BGE/ +// Cohere (512-tok), leave at 0 for OpenAI/Voyage/Jina-v3. interface ChunkingConfig { chunkSize: number chunkOverlap: number @@ -225,11 +238,11 @@ interface ChunkingConfig { enableParentChild: boolean parentChunkSize: number childChunkSize: number - // New: adaptive chunking strategy. Empty string = legacy / not set. + // Adaptive chunking strategy. Empty string = legacy / not set. strategy?: string - // New: cap chunk size in approx tokens. 0 = char-based budget only. + // Cap chunk size in approx tokens. 0 = char-based budget only. tokenLimit?: number - // New: language hints for heuristic patterns (de/en/zh). + // Language hints for heuristic patterns (de/en/zh). languages?: string[] } diff --git a/internal/infrastructure/chunker/splitter.go b/internal/infrastructure/chunker/splitter.go index 1cf51c5b..a3baf0ee 100644 --- a/internal/infrastructure/chunker/splitter.go +++ b/internal/infrastructure/chunker/splitter.go @@ -66,19 +66,33 @@ type SplitterConfig struct { Languages []string } -// Default sizes used by all entry points (DefaultConfig, ensureDefaults, -// and buildSplitterConfig in the knowledge service). +// Default chunk sizing constants. Single source of truth for the entire +// chunker package and (via knowledge.go::buildSplitterConfig) the +// knowledge service. The frontend KnowledgeBaseEditorModal mirrors these +// numbers in its initial form state — keep them in sync if you change +// either value here. +// +// DefaultChunkSize = 512 chars: ~100–130 English tokens / ~300 Chinese +// tokens. Validated as a strong baseline by the Vecta Feb-2026 benchmark +// across 50 academic papers. Use 200–400 for FAQ-style atomic content, +// 1000–2000 for narrative / argumentative documents. +// +// DefaultChunkOverlap = 80 chars (≈15% of DefaultChunkSize): community- +// recommended sweet spot between recall (an answer split across a +// boundary needs overlap to be retrievable) and storage cost. Use 0 for +// strictly atomic data (FAQ, JSON records), 150–200 for long narratives +// where reasoning crosses chunks. // // MIGRATION NOTE: Prior versions had three different overlap defaults // (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python -// docreader: 100). This file is now the single source of truth at 80 -// (≈15% of DefaultChunkSize) — a community-recommended sweet spot. +// docreader: 100). All consolidated to 80 here. // -// Existing knowledge bases that stored ChunkOverlap=0 in the DB will pick -// up 80 on next re-index; their previously-indexed embeddings will not -// match new ones bit-for-bit. Recall stays similar but search ranking -// can shift slightly. To freeze the old behavior on a per-KB basis, -// explicitly set ChunkingConfig.ChunkOverlap to 64 before re-indexing. +// Existing knowledge bases that stored ChunkOverlap=0 in the DB pick +// this 80 up on next re-index; their previously-indexed embeddings will +// not match new ones bit-for-bit. Recall stays similar but search +// ranking can shift slightly. To freeze the old behavior on a per-KB +// basis, explicitly set ChunkingConfig.ChunkOverlap to 64 before +// re-indexing. const ( DefaultChunkSize = 512 DefaultChunkOverlap = 80