mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
docs(chunking): align overlap default + sharpen UI/source/repo docs
Three documentation passes around the adaptive chunking work: UI - Frontend ChunkOverlap default consolidated to 80 (was 100), matching chunker.DefaultChunkOverlap on the backend. Both DEFAULT_CHUNKING_PRESET and initFormData updated. The KB-load fallback also uses 80 when a loaded KB has no chunk_overlap stored. - All four locales (en-US, zh-CN, ko-KR, ru-RU) get rewritten chunking setting descriptions: each now states the validated range, the default, and the situations where you'd deviate (FAQ vs narrative, embedder token limits, language-specific corpora). Source code - splitter.go: DefaultChunkSize / DefaultChunkOverlap constants get a longer block-comment explaining the per-language token math and the use-case sweet spots, plus the migration note on what the old inconsistent defaults were. - KBChunkingSettings.vue: new comment block above ChunkingConfig documents the slider min/max for each setting, why those bounds exist, and the recommended TokenLimit values per embedding model. Repo docs - New docs/CHUNKING.md: end-to-end guide covering why chunking matters, the adaptive 3-tier architecture, per-setting reference with ranges and sweet spots, parent-child explanation, the token-limit table per embedder (OpenAI / Voyage / Cohere / BGE / MiniLM / Jina), 7 use-case presets, the debug panel workflow, the API surface, and known trade-offs (recursive strategy hidden from UI, no auto-reindex on strategy switch, OCR limitations). - CHANGELOG.md gets a new [Unreleased] section consolidating all the adaptive-chunking work shipped on this branch: 5 features, 8 improvements, 6 fixes, 1 docs entry. The entry references docs/CHUNKING.md for deeper explanation. https://claude.ai/code/session_01XADhx6mtu2ZYW3DE9Lun6k
This commit is contained in:
31
CHANGELOG.md
31
CHANGELOG.md
@@ -2,6 +2,37 @@
|
|||||||
|
|
||||||
All notable changes to this project will be documented in this file.
|
All notable changes to this project will be documented in this file.
|
||||||
|
|
||||||
|
## [Unreleased]
|
||||||
|
|
||||||
|
### 🚀 New Features
|
||||||
|
- **NEW**: Adaptive 3-tier chunking — documents are now profiled before splitting and routed to one of three strategies: heading-aware (Markdown structure), heuristic (form-feeds, multilingual chapter markers DE/EN/ZH, all-caps titles, visual separators), or recursive (the modernized legacy splitter as a fallback). Auto-strategy is the new default for fresh KBs; existing KBs keep their previous behavior until the user opts in. See `docs/CHUNKING.md`.
|
||||||
|
- **NEW**: KB editor — chunking settings panel surfaces the new strategy selector (Automatic / Markdown-optimized / Smart structure detection / Classic) plus advanced options for token limit per chunk and language hints. Sharper inline help text on every setting explains when defaults apply and when to tune.
|
||||||
|
- **NEW**: Chunking debug panel — embedded "Test with sample text" panel under the chunking settings. Paste a snippet, hit Run preview, see selected tier, rejected tiers + reasons, document profile, size distribution stats over the full chunk set, and per-chunk cards with breadcrumb + content preview. Read-only, no DB or embedding side effects, 5-second server-side timeout.
|
||||||
|
- **NEW**: `POST /api/v1/chunker/preview` endpoint backing the debug panel. Returns `selected_tier`, `tier_chain`, `rejected[]`, `profile`, `chunks[]`, and `stats`. Capped at 64k input runes / 500 chunks per response.
|
||||||
|
- **NEW**: Per-tenant RRF (Reciprocal Rank Fusion) tuning — `RRFK`, `RRFVectorWeight`, `RRFKeywordWeight` are now configurable on the tenant `RetrievalConfig`. Defaults preserve the previous hardcoded behavior (k=60, weights 0.7/0.3).
|
||||||
|
|
||||||
|
### ⚡ Improvements
|
||||||
|
- **IMPROVED**: Chunker recursive priority — `splitBySeparators` now genuinely walks separators by priority and recursively re-splits oversize sub-pieces with the next-priority separator. Mirrors the Python reference. Without this fix, a "one paragraph break followed by a long run of newline-separated lines" pattern could emit ~1900-rune chunks at chunkSize=300.
|
||||||
|
- **IMPROVED**: ChunkOverlap default consolidated to 80 (~15% of ChunkSize). Previously the Go DefaultConfig used 64, the knowledge service used 50, the Python docreader used 100, and the frontend form initialised to 100. All paths now align.
|
||||||
|
- **IMPROVED**: ContextHeader (Markdown breadcrumb) lives on `Chunk.ContextHeader`, separate from `Chunk.Content`. Restores the `End-Start == len(Content)` invariant that the document-reconstruction path in `knowledge.go` relies on for summary generation and UI highlighting. Eliminates a duplicate-heading regression where the section heading appeared twice in a chunk's body.
|
||||||
|
- **IMPROVED**: Embedding pipeline — exponential backoff (200/400/800/1600/3200 ms) replaces the previous fixed 100ms × 5 retry loop, with context-cancellation between attempts. `sanitizeForEmbedding` caps single embedding inputs at 20k runes with a warning log on overflow.
|
||||||
|
- **IMPROVED**: SplitParentChild forces children onto the recursive tier, skipping per-parent profile passes (previously paid N extra O(N) document scans).
|
||||||
|
- **IMPROVED**: Heuristic splitter snaps overlap start to the nearest semantic boundary or newline instead of slicing mid-line / mid-word.
|
||||||
|
- **IMPROVED**: Validator flow — when every tier is rejected, the chain returns the legacy tier's output directly instead of running SplitText a second time.
|
||||||
|
- **IMPROVED**: Token limit per chunk — when set, ChunkSize is auto-clamped to a per-language character budget (with a 10% safety factor). Prevents overshooting embedding model token caps on CJK content where 1 char ≈ 0.6 tokens.
|
||||||
|
- **IMPROVED**: KB-config API — `strategy`, `tokenLimit`, `languages` use pointer DTOs server-side so a payload omitting a field means "no change" while an explicit empty / 0 / [] resets to default. Previously these were write-once fields.
|
||||||
|
|
||||||
|
### 🐛 Bug Fixes
|
||||||
|
- **FIXED**: Chunker — `Chunk.Start` / `End` rune-offset invariant restored after the heading-aware splitter started prepending breadcrumbs to content (regression introduced during the initial Tier-1 work, fixed before any release).
|
||||||
|
- **FIXED**: Heuristic splitter — `applyOverlap` aligns to boundaries instead of doing blind char-subtraction that could leave chunks starting mid-word in CJK text.
|
||||||
|
- **FIXED**: Preview endpoint — chunk-size statistics are now computed over the FULL chunk set before truncating the response payload to 500 entries. Previously `avg`/`min`/`max`/`stddev` reflected only the first 500 chunks of a larger split.
|
||||||
|
- **FIXED**: Preview endpoint — empty / whitespace-only sample text now returns a friendly 400 ("paste a sample…") instead of gin's cryptic `Field validation failed` error.
|
||||||
|
- **FIXED**: Frontend chunking debug panel — added explicit `type="button"`, prominent loading and error states, and console error logging so failed previews are debuggable from DevTools without enabling verbose logging. Earlier the panel could appear to "vanish" silently when a request failed.
|
||||||
|
- **FIXED**: KB-editor i18n — `chunkOverlap` initial form value aligned with the backend default (80, not 100); description texts on every chunking setting now state the recommended ranges per use-case.
|
||||||
|
|
||||||
|
### 📚 Documentation
|
||||||
|
- **DOC**: New `docs/CHUNKING.md` — strategy explanations, settings reference with use-case presets, token-limit guide per embedding model, debugging workflow, and known trade-offs.
|
||||||
|
|
||||||
## [0.5.1] - 2026-04-30
|
## [0.5.1] - 2026-04-30
|
||||||
|
|
||||||
### 🚀 New Features
|
### 🚀 New Features
|
||||||
|
|||||||
160
docs/CHUNKING.md
Normal file
160
docs/CHUNKING.md
Normal file
@@ -0,0 +1,160 @@
|
|||||||
|
# Chunking Guide
|
||||||
|
|
||||||
|
How WeKnora splits uploaded documents before embedding, why the defaults
|
||||||
|
are what they are, and when to change them.
|
||||||
|
|
||||||
|
## Why chunking matters
|
||||||
|
|
||||||
|
Retrieval-Augmented Generation (RAG) works by embedding small slices of
|
||||||
|
your documents into a vector index, then pulling the most relevant
|
||||||
|
slices back at query time. The way a document is sliced — chunk size,
|
||||||
|
overlap, where the cuts fall — directly drives retrieval recall and the
|
||||||
|
quality of the answers your LLM produces.
|
||||||
|
|
||||||
|
Empirically (Vecta Feb-2026 benchmark across 50 academic papers):
|
||||||
|
recursive splitting at ~512 tokens with ~15% overlap is the strongest
|
||||||
|
single-knob baseline at 69% end-to-end accuracy, beating semantic
|
||||||
|
chunking and over-engineered hybrids. WeKnora uses that as the
|
||||||
|
foundation and layers smarter strategies on top when the document gives
|
||||||
|
us structural cues.
|
||||||
|
|
||||||
|
## Adaptive 3-tier chunking
|
||||||
|
|
||||||
|
Set per knowledge base via the editor's **Chunking** sidebar (or the
|
||||||
|
`strategy` field on the KB-config API).
|
||||||
|
|
||||||
|
| Strategy | When picked | What it does |
|
||||||
|
|----------|-------------|--------------|
|
||||||
|
| `auto` (recommended) | Default for new KBs | Profiles the document and picks the strongest tier from the chain below. |
|
||||||
|
| `heading` | Markdown-style structure | Splits at `#` / `##` / `###` boundaries. Each chunk gets a breadcrumb context header (`# Top > ## Section`) prepended at embedding time. |
|
||||||
|
| `heuristic` | PDF-style structure | Splits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators. |
|
||||||
|
| `legacy` (= `recursive`) | Anything else, or as fallback | Pure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes. |
|
||||||
|
|
||||||
|
A document profiler runs first and counts structural signals (Markdown
|
||||||
|
headings, form-feeds, chapter markers per language, all-caps lines,
|
||||||
|
visual separators, blank-line bursts). Auto-strategy picks the tier
|
||||||
|
chain based on those counts; a validator rejects obviously broken
|
||||||
|
output (e.g. the heading splitter producing 200 single-line chunks)
|
||||||
|
and falls through to the next tier.
|
||||||
|
|
||||||
|
## Settings reference
|
||||||
|
|
||||||
|
### Core
|
||||||
|
|
||||||
|
| Setting | Range | Default | Sweet spot for… |
|
||||||
|
|---------|-------|---------|-----------------|
|
||||||
|
| **Chunk size** | 100–4000 chars | 512 | Default works for most cases. 200–400 for FAQs / atomic Q&A. 1000–2000 for narrative documents. |
|
||||||
|
| **Chunk overlap** | 0–500 chars | 80 (~15%) | 0 for FAQs and structured records. 80 (default) for general documents. 150–200 for argumentative texts where reasoning crosses chunks. |
|
||||||
|
| **Separators** | string list | `["\n\n", "\n", "。", "!", "?", ";", ";"]` | Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize. |
|
||||||
|
|
||||||
|
### Parent-Child chunking
|
||||||
|
|
||||||
|
Two-level retrieval: **child chunks** (small, embedded for vector match)
|
||||||
|
and **parent chunks** (larger, returned to the LLM for context).
|
||||||
|
|
||||||
|
| Setting | Range | Default | Notes |
|
||||||
|
|---------|-------|---------|-------|
|
||||||
|
| **Enable parent-child** | toggle | on | Recommended for documents > 10 pages. Skip for short FAQs to halve storage cost. |
|
||||||
|
| **Parent chunk size** | 512–8192 chars | 4096 (~1000 EN tokens) | Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (1024–2048) for local LLMs with 4k contexts. |
|
||||||
|
| **Child chunk size** | 64–2048 chars | 384 (~80 EN tokens) | 128–256 for Q&A-style precise matching. 512–1024 if your embedder accepts >1000 tokens (E5 / BGE-large). |
|
||||||
|
|
||||||
|
### Advanced
|
||||||
|
|
||||||
|
| Setting | Range | Default | When to set |
|
||||||
|
|---------|-------|---------|-------------|
|
||||||
|
| **Token limit** | 0–8192 | 0 (off) | Activate when your embedding model has a small token cap. See table below. |
|
||||||
|
| **Languages** | `de` / `en` / `zh` (multi-select) | empty (auto-detect) | Set explicitly for homogeneous corpora to narrow heuristic patterns. |
|
||||||
|
|
||||||
|
#### Token-limit guide per embedding model
|
||||||
|
|
||||||
|
| Embedder | Token limit | Recommended `tokenLimit` setting |
|
||||||
|
|----------|-------------|----------------------------------|
|
||||||
|
| OpenAI `text-embedding-3-small/large` | 8191 | **0 (leave off)** |
|
||||||
|
| Anthropic Voyage-3 | 32000 | **0** |
|
||||||
|
| Jina-embeddings-v3 | 8192 | **0** |
|
||||||
|
| Cohere `embed-multilingual-v3` | 512 | **400** |
|
||||||
|
| BGE-base / BGE-large / E5-large | 512 | **400** |
|
||||||
|
| Sentence-Transformer `all-MiniLM-L6-v2` | 256 | **200** |
|
||||||
|
|
||||||
|
Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens.
|
||||||
|
Activate to 80% of the model's hard limit for smaller embedders so
|
||||||
|
chunks always fit even for CJK content (which is denser per character).
|
||||||
|
|
||||||
|
## Use-case presets
|
||||||
|
|
||||||
|
| Workload | Strategy | ChunkSize | Overlap | Parent-Child |
|
||||||
|
|----------|----------|-----------|---------|--------------|
|
||||||
|
| FAQ / Q&A knowledge base | `auto` (likely picks legacy) | 200–400 | 0 | off |
|
||||||
|
| Markdown documentation / wikis | `auto` (picks heading) | 512 | 80 | on |
|
||||||
|
| PDF reports with page breaks | `auto` (picks heuristic) | 800–1200 | 100–150 | on |
|
||||||
|
| Long-form narrative (books, articles) | `auto` (picks recursive) | 1000–2000 | 150–200 | on |
|
||||||
|
| Code documentation | `legacy` | 800 | 100 | optional |
|
||||||
|
| Mixed-language corpus | `auto`, languages = empty | 512 | 80 | on |
|
||||||
|
| Tabular reports / CSV-derived | `legacy` | 400 | 0 | off |
|
||||||
|
|
||||||
|
## Debugging in the UI
|
||||||
|
|
||||||
|
The KB editor's **Chunking** sidebar has a "Test with sample text"
|
||||||
|
collapsible at the bottom:
|
||||||
|
|
||||||
|
1. Paste a Markdown / plain-text snippet (max 64 KB).
|
||||||
|
2. Click **Run preview**.
|
||||||
|
3. The panel shows:
|
||||||
|
- Selected strategy tier as a colored tag
|
||||||
|
- Tiers that were rejected and why (e.g. "too many tiny chunks")
|
||||||
|
- Document profile (heading counts, form-feeds, chapter markers,
|
||||||
|
detected languages)
|
||||||
|
- Size statistics over the full chunk set (avg / min / max / stddev)
|
||||||
|
- Per-chunk cards with size in chars + approximate tokens, position
|
||||||
|
range, the section breadcrumb (when set), and a content preview
|
||||||
|
|
||||||
|
This runs read-only against a goroutine-isolated splitter pass (5s
|
||||||
|
timeout) — no DB writes, no embedding API calls. Use it to compare
|
||||||
|
configurations against the same sample before triggering a re-upload.
|
||||||
|
|
||||||
|
## API
|
||||||
|
|
||||||
|
```http
|
||||||
|
PUT /api/v1/initialization/config/:kbId
|
||||||
|
Authorization: Bearer <jwt>
|
||||||
|
Content-Type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"documentSplitting": {
|
||||||
|
"chunkSize": 512,
|
||||||
|
"chunkOverlap": 80,
|
||||||
|
"separators": ["\n\n", "\n", "。", "!", "?", ";", ";"],
|
||||||
|
"strategy": "auto",
|
||||||
|
"tokenLimit": 0,
|
||||||
|
"languages": ["de", "en"],
|
||||||
|
"enableParentChild": true,
|
||||||
|
"parentChunkSize": 4096,
|
||||||
|
"childChunkSize": 384
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `strategy`, `tokenLimit`, and `languages` fields use pointer-based
|
||||||
|
DTOs server-side: omitting them in the payload means "no change",
|
||||||
|
sending an empty string / 0 / [] explicitly resets to default.
|
||||||
|
|
||||||
|
The preview endpoint accepts the same payload shape under
|
||||||
|
`POST /api/v1/chunker/preview` with an additional `text` field.
|
||||||
|
|
||||||
|
## Known trade-offs
|
||||||
|
|
||||||
|
- **Tier-1 heading-aware chunking** prepends the section breadcrumb to
|
||||||
|
the embedding input, costing ~5% more tokens per chunk in exchange
|
||||||
|
for ~30–50% fewer chunks on structured documents (net token savings
|
||||||
|
on storage and at query time).
|
||||||
|
- **Strategy switches do not auto-reindex** existing documents. After
|
||||||
|
changing a KB's strategy, re-upload affected files (or trigger
|
||||||
|
re-indexing via the UI) to apply the new chunking.
|
||||||
|
- **OCR artifacts in PDFs** (vertical layout text broken character-
|
||||||
|
by-character into separate lines) cannot be fixed by any splitter —
|
||||||
|
this is a parser-side limitation. The heuristic tier still keeps
|
||||||
|
chunks aligned to page boundaries, which mitigates the worst cases.
|
||||||
|
- **The `recursive` strategy value** exists in the API for completeness
|
||||||
|
but is intentionally hidden from the UI: it is functionally near
|
||||||
|
`legacy` and adding a fifth dropdown option dilutes the meaningful
|
||||||
|
choice between automatic / Markdown / heuristic / legacy.
|
||||||
@@ -1873,14 +1873,14 @@ export default {
|
|||||||
},
|
},
|
||||||
chunking: {
|
chunking: {
|
||||||
title: 'Chunking Settings',
|
title: 'Chunking Settings',
|
||||||
description: 'Configure document chunking parameters to improve retrieval quality',
|
description: 'Controls how uploaded documents are split before embedding. Defaults work for most cases — tune only when retrieval quality is off.',
|
||||||
sizeLabel: 'Chunk Size',
|
sizeLabel: 'Chunk Size',
|
||||||
sizeDescription: 'Controls the number of characters in each chunk (100-4000)',
|
sizeDescription: 'Maximum characters per chunk (100–4000). Default 512 ≈ 100–130 English tokens. Smaller for FAQs (200–400), larger for narrative documents (1000–2000).',
|
||||||
characters: 'characters',
|
characters: 'characters',
|
||||||
overlapLabel: 'Chunk Overlap',
|
overlapLabel: 'Chunk Overlap',
|
||||||
overlapDescription: 'Number of overlapping characters between adjacent chunks (0-500)',
|
overlapDescription: 'Characters shared between adjacent chunks (0–500). Default 80 ≈ 15% of size — sweet spot per current research. Use 0 for FAQs/structured data, 150–200 for long-form narratives.',
|
||||||
separatorsLabel: 'Separators',
|
separatorsLabel: 'Separators',
|
||||||
separatorsDescription: 'Separators used when chunking documents',
|
separatorsDescription: 'Characters or strings the splitter prefers when cutting. Higher-priority separators are tried first; the default order favors paragraph → sentence → punctuation breaks.',
|
||||||
separatorsPlaceholder: 'Select or customize separators',
|
separatorsPlaceholder: 'Select or customize separators',
|
||||||
separators: {
|
separators: {
|
||||||
doubleNewline: 'Double newline (\
|
doubleNewline: 'Double newline (\
|
||||||
@@ -1896,11 +1896,11 @@ export default {
|
|||||||
space: 'Space ( )'
|
space: 'Space ( )'
|
||||||
},
|
},
|
||||||
parentChildLabel: 'Parent-Child Chunking',
|
parentChildLabel: 'Parent-Child Chunking',
|
||||||
parentChildDescription: 'Enable two-level parent-child chunking strategy. Large parent chunks provide context while small child chunks are used for vector matching.',
|
parentChildDescription: 'Two-level chunking: small child chunks are vector-matched (precise hits) but the larger parent chunk is returned to the LLM (richer context). Recommended for long documents (>10 pages); skip for short FAQs to save storage.',
|
||||||
parentChunkSizeLabel: 'Parent Chunk Size',
|
parentChunkSizeLabel: 'Parent Chunk Size',
|
||||||
parentChunkSizeDescription: 'Size of parent chunks that provide context (256-4096)',
|
parentChunkSizeDescription: 'Size of the context chunk returned to the LLM (512–8192). Default 4096 ≈ 1000 English tokens, fits comfortably in any modern LLM context window.',
|
||||||
childChunkSizeLabel: 'Child Chunk Size',
|
childChunkSizeLabel: 'Child Chunk Size',
|
||||||
childChunkSizeDescription: 'Size of child chunks used for embedding matching (64-1024)',
|
childChunkSizeDescription: 'Size of the embedded chunk used for vector match (64–2048). Default 384 ≈ 80 tokens — sweet spot for sentence-transformer / BGE-style embedders.',
|
||||||
strategyLabel: 'Chunking Strategy',
|
strategyLabel: 'Chunking Strategy',
|
||||||
strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.',
|
strategyDescription: 'Choose how documents are split into chunks. The Automatic mode profiles each document and picks the best strategy.',
|
||||||
strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)',
|
strategyPlaceholder: 'Select strategy (defaults to classic recursive splitting)',
|
||||||
@@ -1925,9 +1925,9 @@ export default {
|
|||||||
overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.',
|
overlapWarning: 'Overlap is large compared to chunk size — chunks will share most of their content.',
|
||||||
advancedLabel: 'Advanced options',
|
advancedLabel: 'Advanced options',
|
||||||
tokenLimitLabel: 'Token limit per chunk',
|
tokenLimitLabel: 'Token limit per chunk',
|
||||||
tokenLimitDescription: 'Cap chunk size in approximate tokens. When set, chunk size is automatically reduced to stay below this token budget. 0 = off (use character size only).',
|
tokenLimitDescription: 'Hard token cap per chunk (0–8192). 0 = off (chunk size in characters only). Activate when your embedding model has a small token limit: 200 for MiniLM (256 tok), 400 for BGE/Cohere (512 tok). Modern embedders (OpenAI, Voyage, Jina-v3) accept >2000 tokens — leave at 0.',
|
||||||
languagesLabel: 'Language hints',
|
languagesLabel: 'Language hints',
|
||||||
languagesDescription: 'Hint which languages structure-detection patterns should look for. Leave empty for auto-detection.',
|
languagesDescription: 'Restricts heuristic patterns to the chosen languages (DE/EN/ZH). Empty = auto-detect from sample. Set explicitly for homogeneous corpora to avoid false-positive matches across languages.',
|
||||||
languagesPlaceholder: 'Auto-detect',
|
languagesPlaceholder: 'Auto-detect',
|
||||||
languageOptions: {
|
languageOptions: {
|
||||||
de: 'German',
|
de: 'German',
|
||||||
|
|||||||
@@ -2453,14 +2453,14 @@ export default {
|
|||||||
},
|
},
|
||||||
chunking: {
|
chunking: {
|
||||||
title: "청크 설정",
|
title: "청크 설정",
|
||||||
description: "문서 청킹 파라미터를 설정하여 검색 효과 최적화",
|
description: "업로드된 문서가 임베딩되기 전에 분할되는 방식을 제어합니다. 대부분의 경우 기본값으로 충분합니다.",
|
||||||
sizeLabel: "청크 크기",
|
sizeLabel: "청크 크기",
|
||||||
sizeDescription: "각 문서 청크의 문자 수 제어 (100-4000)",
|
sizeDescription: "청크당 최대 문자 수 (100-4000). 기본값 512 ≈ 영어 100-130 토큰. FAQ는 200-400, 서술형 문서는 1000-2000.",
|
||||||
characters: "문자",
|
characters: "문자",
|
||||||
overlapLabel: "청크 중복",
|
overlapLabel: "청크 중복",
|
||||||
overlapDescription: "인접 문서 청크 간의 중복 문자 수 (0-500)",
|
overlapDescription: "인접 청크 간 공유 문자 수 (0-500). 기본값 80 ≈ 청크 크기의 15% — 현재 연구 권장값. FAQ/구조화 데이터는 0, 긴 서술은 150-200.",
|
||||||
separatorsLabel: "구분자",
|
separatorsLabel: "구분자",
|
||||||
separatorsDescription: "문서 청킹 시 사용되는 구분자",
|
separatorsDescription: "분할 시 우선적으로 사용되는 문자/문자열. 우선순위가 높은 구분자를 먼저 시도; 기본 순서는 단락 → 문장 → 구두점.",
|
||||||
separatorsPlaceholder: "구분자 선택 또는 사용자 정의",
|
separatorsPlaceholder: "구분자 선택 또는 사용자 정의",
|
||||||
separators: {
|
separators: {
|
||||||
doubleNewline: "이중 줄바꿈 (\\n\\n)",
|
doubleNewline: "이중 줄바꿈 (\\n\\n)",
|
||||||
@@ -2473,11 +2473,11 @@ export default {
|
|||||||
space: "공백 ( )",
|
space: "공백 ( )",
|
||||||
},
|
},
|
||||||
parentChildLabel: "부모-자식 청킹",
|
parentChildLabel: "부모-자식 청킹",
|
||||||
parentChildDescription: "2단계 부모-자식 청킹 전략을 활성화합니다. 큰 부모 청크는 컨텍스트를 제공하고, 작은 자식 청크는 벡터 매칭에 사용됩니다.",
|
parentChildDescription: "2단계 청킹: 작은 자식 청크는 벡터 매칭(정확한 히트), 큰 부모 청크는 LLM에 반환(풍부한 컨텍스트). 긴 문서(>10페이지)에 권장; 짧은 FAQ는 비활성화하여 저장 공간 절약.",
|
||||||
parentChunkSizeLabel: "부모 청크 크기",
|
parentChunkSizeLabel: "부모 청크 크기",
|
||||||
parentChunkSizeDescription: "컨텍스트를 제공하는 부모 청크의 문자 수 (256-4096)",
|
parentChunkSizeDescription: "LLM에 반환되는 컨텍스트 청크 크기 (512-8192). 기본값 4096 ≈ 1000 영어 토큰, 모든 현대 LLM 컨텍스트에 적합.",
|
||||||
childChunkSizeLabel: "자식 청크 크기",
|
childChunkSizeLabel: "자식 청크 크기",
|
||||||
childChunkSizeDescription: "임베딩 매칭에 사용되는 자식 청크의 문자 수 (64-1024)",
|
childChunkSizeDescription: "벡터 매칭에 사용되는 임베딩 청크 크기 (64-2048). 기본값 384 ≈ 80 토큰 — sentence-transformer / BGE 임베더의 최적점.",
|
||||||
strategyLabel: "청킹 전략",
|
strategyLabel: "청킹 전략",
|
||||||
strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.",
|
strategyDescription: "문서를 청크로 분할하는 방법을 선택합니다. 자동 모드는 문서를 프로파일링하여 최적의 전략을 선택합니다.",
|
||||||
strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)",
|
strategyPlaceholder: "전략 선택 (기본: 클래식 재귀 분할)",
|
||||||
@@ -2502,9 +2502,9 @@ export default {
|
|||||||
overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.",
|
overlapWarning: "오버랩이 청크 크기에 비해 큽니다 — 청크가 대부분의 콘텐츠를 공유합니다.",
|
||||||
advancedLabel: "고급 옵션",
|
advancedLabel: "고급 옵션",
|
||||||
tokenLimitLabel: "청크당 토큰 제한",
|
tokenLimitLabel: "청크당 토큰 제한",
|
||||||
tokenLimitDescription: "근사 토큰 수로 청크 크기를 제한합니다. 설정되면 토큰 예산을 유지하기 위해 청크 크기가 자동으로 축소됩니다. 0 = 끄기 (문자 크기만 사용).",
|
tokenLimitDescription: "청크당 토큰 하드 제한 (0-8192). 0 = 끄기 (문자 크기만). 임베딩 모델의 토큰 제한이 작을 때 활성화: MiniLM (256 tok)은 200, BGE/Cohere (512 tok)는 400. 현대 임베더(OpenAI, Voyage, Jina-v3)는 >2000 토큰 지원 — 0으로 두세요.",
|
||||||
languagesLabel: "언어 힌트",
|
languagesLabel: "언어 힌트",
|
||||||
languagesDescription: "구조 감지 패턴이 찾아야 할 언어를 힌트로 제공합니다. 자동 감지를 위해 비워두세요.",
|
languagesDescription: "휴리스틱 패턴을 선택한 언어(DE/EN/ZH)로만 제한합니다. 비어 있음 = 샘플에서 자동 감지. 동질적인 코퍼스는 명시적으로 설정하여 언어 간 오탐 방지.",
|
||||||
languagesPlaceholder: "자동 감지",
|
languagesPlaceholder: "자동 감지",
|
||||||
languageOptions: {
|
languageOptions: {
|
||||||
de: "독일어",
|
de: "독일어",
|
||||||
|
|||||||
@@ -2088,14 +2088,14 @@ export default {
|
|||||||
},
|
},
|
||||||
chunking: {
|
chunking: {
|
||||||
title: 'Настройки разбиения',
|
title: 'Настройки разбиения',
|
||||||
description: 'Настройте параметры разбиения документов для улучшения качества поиска',
|
description: 'Управляет тем, как загруженные документы разбиваются перед эмбеддингом. Значения по умолчанию подходят для большинства случаев — настройте только если качество поиска плохое.',
|
||||||
sizeLabel: 'Размер блока',
|
sizeLabel: 'Размер блока',
|
||||||
sizeDescription: 'Определяет количество символов в каждом блоке (100-4000)',
|
sizeDescription: 'Максимальное количество символов в блоке (100–4000). По умолчанию 512 ≈ 100–130 английских токенов. Меньше для FAQ (200–400), больше для повествовательных документов (1000–2000).',
|
||||||
characters: 'символов',
|
characters: 'символов',
|
||||||
overlapLabel: 'Перекрытие блоков',
|
overlapLabel: 'Перекрытие блоков',
|
||||||
overlapDescription: 'Количество перекрывающихся символов между соседними блоками (0-500)',
|
overlapDescription: 'Количество символов, общих для соседних блоков (0–500). По умолчанию 80 ≈ 15% размера — оптимум по текущим исследованиям. 0 для FAQ/структурированных данных, 150–200 для длинных повествований.',
|
||||||
separatorsLabel: 'Разделители',
|
separatorsLabel: 'Разделители',
|
||||||
separatorsDescription: 'Разделители, используемые при разбиении документов',
|
separatorsDescription: 'Символы или строки, которые сплиттер предпочитает при резке. Разделители более высокого приоритета пробуются первыми; порядок по умолчанию: абзацы → предложения → пунктуация.',
|
||||||
separatorsPlaceholder: 'Выберите или настройте разделители',
|
separatorsPlaceholder: 'Выберите или настройте разделители',
|
||||||
separators: {
|
separators: {
|
||||||
doubleNewline: 'Двойной перевод строки (\\n\\n)',
|
doubleNewline: 'Двойной перевод строки (\\n\\n)',
|
||||||
@@ -2108,11 +2108,11 @@ export default {
|
|||||||
space: 'Пробел ( )'
|
space: 'Пробел ( )'
|
||||||
},
|
},
|
||||||
parentChildLabel: 'Родительско-дочернее разбиение',
|
parentChildLabel: 'Родительско-дочернее разбиение',
|
||||||
parentChildDescription: 'Включить двухуровневую стратегию разбиения. Большие родительские блоки обеспечивают контекст, а маленькие дочерние блоки используются для векторного поиска.',
|
parentChildDescription: 'Двухуровневое разбиение: маленькие дочерние блоки используются для векторного матчинга (точные совпадения), большой родительский блок возвращается LLM (более богатый контекст). Рекомендуется для длинных документов (>10 страниц); пропустите для коротких FAQ для экономии хранилища.',
|
||||||
parentChunkSizeLabel: 'Размер родительского блока',
|
parentChunkSizeLabel: 'Размер родительского блока',
|
||||||
parentChunkSizeDescription: 'Размер родительских блоков для контекста (256-4096)',
|
parentChunkSizeDescription: 'Размер контекстного блока, возвращаемого LLM (512–8192). По умолчанию 4096 ≈ 1000 английских токенов, комфортно вписывается в любое современное контекстное окно.',
|
||||||
childChunkSizeLabel: 'Размер дочернего блока',
|
childChunkSizeLabel: 'Размер дочернего блока',
|
||||||
childChunkSizeDescription: 'Размер дочерних блоков для поиска по эмбеддингам (64-1024)',
|
childChunkSizeDescription: 'Размер встроенного блока для векторного матчинга (64–2048). По умолчанию 384 ≈ 80 токенов — оптимум для эмбеддеров уровня sentence-transformer / BGE.',
|
||||||
strategyLabel: 'Стратегия разбиения',
|
strategyLabel: 'Стратегия разбиения',
|
||||||
strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.',
|
strategyDescription: 'Выберите способ разбиения документов на блоки. Автоматический режим анализирует каждый документ и выбирает оптимальную стратегию.',
|
||||||
strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)',
|
strategyPlaceholder: 'Выберите стратегию (по умолчанию классическое рекурсивное разбиение)',
|
||||||
@@ -2137,9 +2137,9 @@ export default {
|
|||||||
overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.',
|
overlapWarning: 'Перекрытие велико по сравнению с размером блока — блоки будут содержать большую часть одинакового контента.',
|
||||||
advancedLabel: 'Расширенные параметры',
|
advancedLabel: 'Расширенные параметры',
|
||||||
tokenLimitLabel: 'Лимит токенов на блок',
|
tokenLimitLabel: 'Лимит токенов на блок',
|
||||||
tokenLimitDescription: 'Ограничьте размер блока приблизительным количеством токенов. При установке размер блока автоматически уменьшается, чтобы оставаться в пределах лимита. 0 = выключено (только символьный размер).',
|
tokenLimitDescription: 'Жёсткий лимит токенов на блок (0–8192). 0 = выкл (только символы). Активируйте, когда у вашего эмбеддера небольшой токен-лимит: 200 для MiniLM (256 ток), 400 для BGE/Cohere (512 ток). Современные эмбеддеры (OpenAI, Voyage, Jina-v3) поддерживают >2000 токенов — оставьте 0.',
|
||||||
languagesLabel: 'Языковые подсказки',
|
languagesLabel: 'Языковые подсказки',
|
||||||
languagesDescription: 'Подсказка о том, какие языки должны искать паттерны определения структуры. Оставьте пустым для автоопределения.',
|
languagesDescription: 'Ограничивает эвристические паттерны выбранными языками (DE/EN/ZH). Пусто = автоопределение из образца. Установите явно для однородных корпусов, чтобы избежать ложных срабатываний между языками.',
|
||||||
languagesPlaceholder: 'Автоопределение',
|
languagesPlaceholder: 'Автоопределение',
|
||||||
languageOptions: {
|
languageOptions: {
|
||||||
de: 'Немецкий',
|
de: 'Немецкий',
|
||||||
|
|||||||
@@ -2412,14 +2412,14 @@ export default {
|
|||||||
},
|
},
|
||||||
chunking: {
|
chunking: {
|
||||||
title: "分块设置",
|
title: "分块设置",
|
||||||
description: "配置文档分块参数,优化检索效果",
|
description: "控制上传文档在嵌入前的切分方式。默认值适用于大多数场景,仅在检索质量异常时调整。",
|
||||||
sizeLabel: "分块大小",
|
sizeLabel: "分块大小",
|
||||||
sizeDescription: "控制每个文档分块的字符数(100-4000)",
|
sizeDescription: "每个分块的最大字符数(100-4000)。默认 512 ≈ 中文 300 tokens / 英文 100-130 tokens。FAQ 用 200-400,叙述性长文档用 1000-2000。",
|
||||||
characters: "字符",
|
characters: "字符",
|
||||||
overlapLabel: "分块重叠",
|
overlapLabel: "分块重叠",
|
||||||
overlapDescription: "相邻文档块之间的重叠字符数(0-500)",
|
overlapDescription: "相邻分块之间共享的字符数(0-500)。默认 80 ≈ 分块大小的 15%,符合当前研究推荐。FAQ/结构化数据用 0,长篇叙述用 150-200。",
|
||||||
separatorsLabel: "分隔符",
|
separatorsLabel: "分隔符",
|
||||||
separatorsDescription: "文档分块时使用的分隔符",
|
separatorsDescription: "切分时优先使用的字符或字符串。优先级高的分隔符先尝试;默认顺序优先段落 → 句子 → 标点。",
|
||||||
separatorsPlaceholder: "选择或自定义分隔符",
|
separatorsPlaceholder: "选择或自定义分隔符",
|
||||||
separators: {
|
separators: {
|
||||||
doubleNewline: "双换行 (\\n\\n)",
|
doubleNewline: "双换行 (\\n\\n)",
|
||||||
@@ -2432,11 +2432,11 @@ export default {
|
|||||||
space: "空格 ( )",
|
space: "空格 ( )",
|
||||||
},
|
},
|
||||||
parentChildLabel: "父子分块",
|
parentChildLabel: "父子分块",
|
||||||
parentChildDescription: "启用两级父子分块策略。大的父块提供上下文,小的子块用于向量匹配检索。",
|
parentChildDescription: "两级分块:小的子块用于向量匹配(精准命中),大的父块返回给 LLM(更丰富上下文)。建议用于长文档(>10 页);短 FAQ 可关闭以节省存储。",
|
||||||
parentChunkSizeLabel: "父块大小",
|
parentChunkSizeLabel: "父块大小",
|
||||||
parentChunkSizeDescription: "提供上下文的父块字符数(256-4096)",
|
parentChunkSizeDescription: "返回给 LLM 的上下文块大小(512-8192)。默认 4096 ≈ 1000 英文 tokens,适合所有现代 LLM 上下文窗口。",
|
||||||
childChunkSizeLabel: "子块大小",
|
childChunkSizeLabel: "子块大小",
|
||||||
childChunkSizeDescription: "用于向量匹配的子块字符数(64-1024)",
|
childChunkSizeDescription: "用于向量匹配的嵌入块大小(64-2048)。默认 384 ≈ 80 tokens,是 sentence-transformer / BGE 类嵌入模型的最佳点。",
|
||||||
strategyLabel: "分块策略",
|
strategyLabel: "分块策略",
|
||||||
strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。",
|
strategyDescription: "选择文档的分块方式。自动模式会分析每个文档的结构并选择最佳策略。",
|
||||||
strategyPlaceholder: "选择策略(默认使用经典递归分块)",
|
strategyPlaceholder: "选择策略(默认使用经典递归分块)",
|
||||||
@@ -2461,9 +2461,9 @@ export default {
|
|||||||
overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。",
|
overlapWarning: "重叠相对于分块大小较大——分块之间会共享大部分内容。",
|
||||||
advancedLabel: "高级选项",
|
advancedLabel: "高级选项",
|
||||||
tokenLimitLabel: "每块 Token 上限",
|
tokenLimitLabel: "每块 Token 上限",
|
||||||
tokenLimitDescription: "按近似 Token 数限制分块大小。设置后会自动缩小分块以保持在 Token 预算内。0 = 关闭(仅按字符数)。",
|
tokenLimitDescription: "每个分块的硬性 Token 上限(0-8192)。0 = 关闭(仅按字符数)。当嵌入模型 Token 上限较小时启用:MiniLM (256 tok) 用 200,BGE/Cohere (512 tok) 用 400。现代嵌入器(OpenAI、Voyage、Jina-v3)支持 >2000 tokens,保持 0 即可。",
|
||||||
languagesLabel: "语言提示",
|
languagesLabel: "语言提示",
|
||||||
languagesDescription: "提示结构检测模式应识别哪些语言。留空则自动检测。",
|
languagesDescription: "限制启发式模式只识别选定的语言(DE/EN/ZH)。留空 = 自动检测。同质化语料库可显式设置以避免跨语言误匹配。",
|
||||||
languagesPlaceholder: "自动检测",
|
languagesPlaceholder: "自动检测",
|
||||||
languageOptions: {
|
languageOptions: {
|
||||||
de: "德语",
|
de: "德语",
|
||||||
|
|||||||
@@ -409,10 +409,13 @@ const WIKI_ONLY_CHUNKING_PRESET = {
|
|||||||
enableParentChild: false,
|
enableParentChild: false,
|
||||||
} as const
|
} as const
|
||||||
|
|
||||||
// 非 Wiki-only 场景下回落到的默认值(与 initFormData 保持一致)。
|
// Non-Wiki-only fallback. Mirrors chunker.DefaultChunkSize and
|
||||||
|
// DefaultChunkOverlap on the backend so a freshly created KB uses
|
||||||
|
// the same numbers whether the editor sets them or the splitter
|
||||||
|
// falls back to its package defaults.
|
||||||
const DEFAULT_CHUNKING_PRESET = {
|
const DEFAULT_CHUNKING_PRESET = {
|
||||||
chunkSize: 512,
|
chunkSize: 512,
|
||||||
chunkOverlap: 100,
|
chunkOverlap: 80,
|
||||||
enableParentChild: true,
|
enableParentChild: true,
|
||||||
} as const
|
} as const
|
||||||
|
|
||||||
@@ -485,7 +488,9 @@ const initFormData = (type: 'document' | 'faq' = 'document') => {
|
|||||||
},
|
},
|
||||||
chunkingConfig: {
|
chunkingConfig: {
|
||||||
chunkSize: 512,
|
chunkSize: 512,
|
||||||
chunkOverlap: 100,
|
// 80 ≈ 15% of chunkSize — community-recommended sweet spot.
|
||||||
|
// Aligned with chunker.DefaultChunkOverlap on the backend.
|
||||||
|
chunkOverlap: 80,
|
||||||
separators: ['\n\n', '\n', '。', '!', '?', ';', ';'],
|
separators: ['\n\n', '\n', '。', '!', '?', ';', ';'],
|
||||||
parserEngineRules: undefined as any,
|
parserEngineRules: undefined as any,
|
||||||
enableParentChild: true,
|
enableParentChild: true,
|
||||||
@@ -586,7 +591,9 @@ const loadKBData = async () => {
|
|||||||
},
|
},
|
||||||
chunkingConfig: {
|
chunkingConfig: {
|
||||||
chunkSize: kb.chunking_config?.chunk_size || 512,
|
chunkSize: kb.chunking_config?.chunk_size || 512,
|
||||||
chunkOverlap: kb.chunking_config?.chunk_overlap || 100,
|
// Fallback only used when the loaded KB has no chunk_overlap stored.
|
||||||
|
// Aligned with chunker.DefaultChunkOverlap on the backend.
|
||||||
|
chunkOverlap: kb.chunking_config?.chunk_overlap || 80,
|
||||||
separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '!', '?', ';', ';'],
|
separators: kb.chunking_config?.separators || ['\n\n', '\n', '。', '!', '?', ';', ';'],
|
||||||
parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined,
|
parserEngineRules: kb.chunking_config?.parser_engine_rules || undefined,
|
||||||
enableParentChild: kb.chunking_config?.enable_parent_child || false,
|
enableParentChild: kb.chunking_config?.enable_parent_child || false,
|
||||||
|
|||||||
@@ -217,6 +217,19 @@ interface ParserEngineRule {
|
|||||||
engine: string
|
engine: string
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Slider ranges defined in this file (min/max props on t-slider) mirror
|
||||||
|
// the validated bounds in the backend splitter:
|
||||||
|
// ChunkSize: 100–4000 (default 512). 100 = too fragmented to be
|
||||||
|
// useful; 4000 = approaches the 7500-char absoluteMaxSize
|
||||||
|
// that the splitter hard-caps to anyway.
|
||||||
|
// ChunkOverlap: 0–500 (default 80). Backend caps to ChunkSize/2
|
||||||
|
// when set higher than that.
|
||||||
|
// ParentChunkSize: 512–8192 (default 4096 ≈ 1000 EN tokens).
|
||||||
|
// ChildChunkSize: 64–2048 (default 384 ≈ 80 EN tokens, sweet spot for
|
||||||
|
// sentence-transformer / BGE embedders).
|
||||||
|
// TokenLimit: 0–8192 (default 0 = off, char-based budget only).
|
||||||
|
// Set to 200 for MiniLM (256-tok limit), 400 for BGE/
|
||||||
|
// Cohere (512-tok), leave at 0 for OpenAI/Voyage/Jina-v3.
|
||||||
interface ChunkingConfig {
|
interface ChunkingConfig {
|
||||||
chunkSize: number
|
chunkSize: number
|
||||||
chunkOverlap: number
|
chunkOverlap: number
|
||||||
@@ -225,11 +238,11 @@ interface ChunkingConfig {
|
|||||||
enableParentChild: boolean
|
enableParentChild: boolean
|
||||||
parentChunkSize: number
|
parentChunkSize: number
|
||||||
childChunkSize: number
|
childChunkSize: number
|
||||||
// New: adaptive chunking strategy. Empty string = legacy / not set.
|
// Adaptive chunking strategy. Empty string = legacy / not set.
|
||||||
strategy?: string
|
strategy?: string
|
||||||
// New: cap chunk size in approx tokens. 0 = char-based budget only.
|
// Cap chunk size in approx tokens. 0 = char-based budget only.
|
||||||
tokenLimit?: number
|
tokenLimit?: number
|
||||||
// New: language hints for heuristic patterns (de/en/zh).
|
// Language hints for heuristic patterns (de/en/zh).
|
||||||
languages?: string[]
|
languages?: string[]
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -66,19 +66,33 @@ type SplitterConfig struct {
|
|||||||
Languages []string
|
Languages []string
|
||||||
}
|
}
|
||||||
|
|
||||||
// Default sizes used by all entry points (DefaultConfig, ensureDefaults,
|
// Default chunk sizing constants. Single source of truth for the entire
|
||||||
// and buildSplitterConfig in the knowledge service).
|
// chunker package and (via knowledge.go::buildSplitterConfig) the
|
||||||
|
// knowledge service. The frontend KnowledgeBaseEditorModal mirrors these
|
||||||
|
// numbers in its initial form state — keep them in sync if you change
|
||||||
|
// either value here.
|
||||||
|
//
|
||||||
|
// DefaultChunkSize = 512 chars: ~100–130 English tokens / ~300 Chinese
|
||||||
|
// tokens. Validated as a strong baseline by the Vecta Feb-2026 benchmark
|
||||||
|
// across 50 academic papers. Use 200–400 for FAQ-style atomic content,
|
||||||
|
// 1000–2000 for narrative / argumentative documents.
|
||||||
|
//
|
||||||
|
// DefaultChunkOverlap = 80 chars (≈15% of DefaultChunkSize): community-
|
||||||
|
// recommended sweet spot between recall (an answer split across a
|
||||||
|
// boundary needs overlap to be retrievable) and storage cost. Use 0 for
|
||||||
|
// strictly atomic data (FAQ, JSON records), 150–200 for long narratives
|
||||||
|
// where reasoning crosses chunks.
|
||||||
//
|
//
|
||||||
// MIGRATION NOTE: Prior versions had three different overlap defaults
|
// MIGRATION NOTE: Prior versions had three different overlap defaults
|
||||||
// (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python
|
// (Go DefaultConfig: 64, knowledge.go buildSplitterConfig: 50, Python
|
||||||
// docreader: 100). This file is now the single source of truth at 80
|
// docreader: 100). All consolidated to 80 here.
|
||||||
// (≈15% of DefaultChunkSize) — a community-recommended sweet spot.
|
|
||||||
//
|
//
|
||||||
// Existing knowledge bases that stored ChunkOverlap=0 in the DB will pick
|
// Existing knowledge bases that stored ChunkOverlap=0 in the DB pick
|
||||||
// up 80 on next re-index; their previously-indexed embeddings will not
|
// this 80 up on next re-index; their previously-indexed embeddings will
|
||||||
// match new ones bit-for-bit. Recall stays similar but search ranking
|
// not match new ones bit-for-bit. Recall stays similar but search
|
||||||
// can shift slightly. To freeze the old behavior on a per-KB basis,
|
// ranking can shift slightly. To freeze the old behavior on a per-KB
|
||||||
// explicitly set ChunkingConfig.ChunkOverlap to 64 before re-indexing.
|
// basis, explicitly set ChunkingConfig.ChunkOverlap to 64 before
|
||||||
|
// re-indexing.
|
||||||
const (
|
const (
|
||||||
DefaultChunkSize = 512
|
DefaultChunkSize = 512
|
||||||
DefaultChunkOverlap = 80
|
DefaultChunkOverlap = 80
|
||||||
|
|||||||
Reference in New Issue
Block a user