mirror of
https://github.com/Tencent/WeKnora.git
synced 2026-06-04 13:30:32 +08:00
refactor(agent): simplify grep_chunks tool to a single regex query
The grep_chunks tool previously accepted an array of regex queries (1-5) and an optional knowledge_base_ids filter and limit. In practice the LLM either fired multiple near-duplicate calls or split synonyms across entries instead of using POSIX alternation, and KB scoping plus result limit are server-side concerns the model should not control. Reshape the contract to match `grep -E -i` semantics: - Schema accepts a single required `query` string. Combine concepts with `|` alternation in one regex instead of multiple calls. - Drop `knowledge_base_ids` and `limit` from the schema; the tool now always searches the full agent scope and uses a fixed internal cap. - Legacy `pattern`, `queries`, `patterns`, `max_results` keys are still accepted and joined into a single alternation regex so older callers and in-flight model outputs keep working. - Update the agent system prompt template to document the new single `query` field. - Frontend tool title now reads `query`/`queries`/`pattern`/`patterns` in that order so the search text is shown again under the new schema. - Add a dedicated `grepSearch` / `grepSearchFailed` tool status (zh-CN, en-US, ko-KR, ru-RU) and rename the zh-CN tool label to "搜索关键词" so the UI no longer prefixes the call with a generic "调用 ..." label.
This commit is contained in:
@@ -128,7 +128,7 @@ templates:
|
||||
|
||||
### Core Retrieval Strategy (Strict Sequence)
|
||||
For every retrieval attempt (Phase 1 or Phase 3), follow this exact chain:
|
||||
1. **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). STRONGLY PREFER using regex to search for multiple concepts at once — pack 2-3 terms into one alternation query (e.g. `stardust|skyvault|psionic`) rather than firing several single-keyword calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading. Input field is `queries` (array, 1-5).
|
||||
1. **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Behaves like `grep -E -i`. Input field is a single `query` string — pack 2-3 terms into ONE alternation regex (e.g. `stardust|skyvault|psionic`) rather than firing several calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading.
|
||||
2. **Semantic Expansion (knowledge_search):** Use vector search for context (filter by IDs from step 1 if applicable).
|
||||
3. **Deep Contextualization (list_knowledge_chunks): MANDATORY.**
|
||||
* Rule: After Step 1 or 2 returns knowledge_ids, you MUST call this tool.
|
||||
@@ -137,7 +137,7 @@ templates:
|
||||
5. **Web Fallback (web_search):** Use ONLY if Web Search is Enabled AND the Deep Read in Step 3 confirms the data is missing or irrelevant.
|
||||
|
||||
### Tool Selection Guidelines
|
||||
* **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input field is `queries` (1–5 regex strings); STRONGLY PREFER one alternation query (`a|b|c`) over multiple single-keyword calls. Literal text like `engine` also works. `knowledge_search` accepts 1–5 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
|
||||
* **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input is a single `query` string; pack synonyms into ONE alternation regex (`a|b|c`) instead of multiple calls. Literal text like `engine` also works. `knowledge_search` accepts 1–5 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
|
||||
* **list_knowledge_chunks:** Your "Eyes". MUST be used after every search. Use to read what the information is.
|
||||
* **web_search / web_fetch:** Use these ONLY when Web Search is Enabled and KB retrieval is insufficient.
|
||||
* **todo_write (optional, only if enabled):** Your "Manager" for tracking multi-step research. Only use it when the user has added it to the tool list.
|
||||
@@ -492,7 +492,7 @@ templates:
|
||||
* **wiki_search:** Wiki entry point. Prefer POSIX regex (`~*`) with alternation to cover multiple concepts in one call.
|
||||
* **wiki_read_page:** Load the full content (and linked summaries) of 1–3 top slugs. Batch multiple slugs in a single call when possible.
|
||||
* **knowledge_search:** Semantic retrieval over raw chunks. Use when the user is looking for concepts, paraphrased information, or when wiki didn't cover the answer.
|
||||
* **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input field is `queries` (1–5 regex strings). STRONGLY PREFER using regex to search for multiple concepts at once — one alternation query `error_code_a|error_code_b` beats multiple single-keyword calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
|
||||
* **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input is a single `query` string. Pack multiple concepts into ONE alternation regex (`error_code_a|error_code_b`) — do not split into multiple calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
|
||||
* **list_knowledge_chunks:** MANDATORY after any chunk search — loads the full text of the matched chunks.
|
||||
* **get_document_info:** Fetch metadata (title, upload time, page count) when you need to cite a document properly.
|
||||
* **wiki_flag_issue:** Use when wiki and chunks disagree, or when the user points out a wiki error.
|
||||
|
||||
@@ -3889,6 +3889,8 @@ export default {
|
||||
searchKbFailed: 'Knowledge base search failed',
|
||||
webSearch: 'Web search',
|
||||
webSearchFailed: 'Web search failed',
|
||||
grepSearch: 'Keyword search',
|
||||
grepSearchFailed: 'Keyword search failed',
|
||||
getDocInfo: 'Getting document info',
|
||||
getDocInfoFailed: 'Failed to get document info',
|
||||
thinkingDone: 'Thinking complete',
|
||||
|
||||
@@ -3952,6 +3952,8 @@ export default {
|
||||
searchKbFailed: '지식베이스 검색 실패',
|
||||
webSearch: '웹 검색',
|
||||
webSearchFailed: '웹 검색 실패',
|
||||
grepSearch: '키워드 검색',
|
||||
grepSearchFailed: '키워드 검색 실패',
|
||||
getDocInfo: '문서 정보 조회',
|
||||
getDocInfoFailed: '문서 정보 조회 실패',
|
||||
thinkingDone: '사고 완료',
|
||||
|
||||
@@ -3527,6 +3527,8 @@ export default {
|
||||
searchKbFailed: 'Ошибка поиска по базе знаний',
|
||||
webSearch: 'Веб-поиск',
|
||||
webSearchFailed: 'Ошибка веб-поиска',
|
||||
grepSearch: 'Поиск по ключевым словам',
|
||||
grepSearchFailed: 'Ошибка поиска по ключевым словам',
|
||||
getDocInfo: 'Получение информации о документе',
|
||||
getDocInfoFailed: 'Ошибка получения информации о документе',
|
||||
thinkingDone: 'Размышление завершено',
|
||||
|
||||
@@ -3723,7 +3723,7 @@ export default {
|
||||
tools: {
|
||||
multiKbSearch: "跨库搜索",
|
||||
knowledgeSearch: "知识库搜索",
|
||||
grepChunks: "文本模式搜索",
|
||||
grepChunks: "搜索关键词",
|
||||
getChunkDetail: "获取片段详情",
|
||||
listKnowledgeChunks: "查看知识分块",
|
||||
listKnowledgeBases: "列出知识库",
|
||||
@@ -3824,7 +3824,7 @@ export default {
|
||||
},
|
||||
tools: {
|
||||
searchKnowledge: "知识库检索",
|
||||
grepChunks: "文本模式搜索",
|
||||
grepChunks: "搜索关键词",
|
||||
webSearch: "网络搜索",
|
||||
webFetch: "网页抓取",
|
||||
getDocumentInfo: "获取文档信息",
|
||||
@@ -3884,6 +3884,8 @@ export default {
|
||||
searchKbFailed: "检索知识库失败",
|
||||
webSearch: "网络搜索",
|
||||
webSearchFailed: "网络搜索失败",
|
||||
grepSearch: "搜索关键词",
|
||||
grepSearchFailed: "搜索关键词失败",
|
||||
getDocInfo: "获取文档信息",
|
||||
getDocInfoFailed: "获取文档信息失败",
|
||||
thinkingDone: "完成思考",
|
||||
|
||||
@@ -2200,14 +2200,22 @@ const getToolTitle = (event: any): string => {
|
||||
// Try to get patterns from arguments or tool_data
|
||||
let patterns: string[] = [];
|
||||
if (event.arguments && typeof event.arguments === 'object') {
|
||||
if (Array.isArray(event.arguments.patterns)) {
|
||||
if (Array.isArray(event.arguments.queries)) {
|
||||
patterns = event.arguments.queries;
|
||||
} else if (Array.isArray(event.arguments.patterns)) {
|
||||
patterns = event.arguments.patterns;
|
||||
} else if (event.arguments.query) {
|
||||
patterns = [event.arguments.query];
|
||||
} else if (event.arguments.pattern) {
|
||||
patterns = [event.arguments.pattern];
|
||||
}
|
||||
} else if (event.tool_data) {
|
||||
if (Array.isArray(event.tool_data.patterns)) {
|
||||
if (Array.isArray(event.tool_data.queries)) {
|
||||
patterns = event.tool_data.queries;
|
||||
} else if (Array.isArray(event.tool_data.patterns)) {
|
||||
patterns = event.tool_data.patterns;
|
||||
} else if (event.tool_data.query) {
|
||||
patterns = [event.tool_data.query];
|
||||
} else if (event.tool_data.pattern) {
|
||||
patterns = [event.tool_data.pattern];
|
||||
}
|
||||
@@ -2244,6 +2252,8 @@ const getToolDescription = (event: any): string => {
|
||||
return success ? t('agentStream.toolStatus.searchKb') : t('agentStream.toolStatus.searchKbFailed');
|
||||
} else if (toolName === 'web_search') {
|
||||
return success ? t('agentStream.toolStatus.webSearch') : t('agentStream.toolStatus.webSearchFailed');
|
||||
} else if (toolName === 'grep_chunks') {
|
||||
return success ? t('agentStream.toolStatus.grepSearch') : t('agentStream.toolStatus.grepSearchFailed');
|
||||
} else if (toolName === 'get_document_info') {
|
||||
return success ? t('agentStream.toolStatus.getDocInfo') : t('agentStream.toolStatus.getDocInfoFailed');
|
||||
} else if (toolName === 'thinking') {
|
||||
|
||||
@@ -18,12 +18,12 @@ import (
|
||||
|
||||
var grepChunksTool = BaseTool{
|
||||
name: ToolGrepChunks,
|
||||
description: `Search knowledge base chunk content using PostgreSQL POSIX regular expressions (~* operator, case-insensitive; REGEXP on MySQL/SQLite).
|
||||
STRONGLY PREFER using regex to search for multiple concepts at once rather than simple plain text queries.
|
||||
Returns matching chunks with per-pattern hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
|
||||
description: `Search knowledge base chunk content with a single POSIX regular expression, applied directly in the database (PostgreSQL ~* / MySQL/SQLite REGEXP, case-insensitive). Behaves like ` + "`grep -E -i`" + `.
|
||||
Pack multiple concepts into ONE regex using ` + "`|`" + ` alternation — do not call this tool repeatedly for synonyms.
|
||||
Returns matching chunks with hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
|
||||
Examples:
|
||||
- Alternation (RECOMMENDED): "stardust|skyvault" (matches either word)
|
||||
- Multiple terms (RECOMMENDED): "psionic.*engine" (matches both words in order)
|
||||
- Alternation (RECOMMENDED): "stardust|skyvault|psionic" (matches any of the words)
|
||||
- Multiple terms in order: "psionic.*engine" (matches both words in order)
|
||||
- Word boundary / anchor: "\\brag\\b" or "^chapter\\s+\\d+"
|
||||
- Plain text: "engine" (matches literal substring anywhere in chunk content)
|
||||
IMPORTANT — JSON escaping: every backslash in a regex MUST be written as \\ inside the JSON tool arguments (e.g. to search for literal "C++" write "C\\+\\+", NOT "C\+\+"; for "\d+" write "\\d+"). Plain "\+" / "\d" etc. are invalid JSON escapes and will fail to parse.
|
||||
@@ -31,40 +31,24 @@ Use this to locate candidate chunks by exact identifiers, error codes, product n
|
||||
schema: json.RawMessage(`{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"queries": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"description": "List of regex queries to run. A chunk matches when ANY query matches its content. Prefer one alternation query (\"a|b|c\") over multiple single-keyword queries.",
|
||||
"minItems": 1,
|
||||
"maxItems": 5
|
||||
},
|
||||
"knowledge_base_ids": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"description": "Optional: restrict search to specific KB IDs within the agent scope."
|
||||
},
|
||||
"limit": {
|
||||
"type": "integer",
|
||||
"description": "Max matching chunks to return (default 30, max 100).",
|
||||
"default": 30,
|
||||
"minimum": 1,
|
||||
"maximum": 100
|
||||
"query": {
|
||||
"type": "string",
|
||||
"description": "A single POSIX regex applied directly to chunk content (case-insensitive). Combine multiple concepts with \"|\" alternation in ONE regex (e.g. \"stardust|skyvault|psionic\") — do not split into multiple calls.",
|
||||
"minLength": 1
|
||||
}
|
||||
},
|
||||
"required": ["queries"]
|
||||
"required": ["query"]
|
||||
}`),
|
||||
}
|
||||
|
||||
// GrepChunksInput defines the input parameters for grep chunks tool.
|
||||
// The canonical parameter names are `queries` and `limit` (mirroring
|
||||
// wiki_search). The legacy `patterns` and `max_results` keys remain accepted
|
||||
// so older model outputs or external callers don't break silently.
|
||||
// The canonical parameter is a single `query` string (a regex with optional
|
||||
// `|` alternation), matching real `grep -E` semantics. Legacy array forms
|
||||
// (`queries`, `patterns`) and the singular `pattern` alias remain accepted
|
||||
// so older model outputs or external callers don't break silently — they
|
||||
// are joined together into a single alternation regex before execution.
|
||||
type GrepChunksInput struct {
|
||||
Queries []string `json:"queries,omitempty"`
|
||||
Patterns []string `json:"patterns,omitempty"` // legacy alias for queries
|
||||
KnowledgeBaseIDs []string `json:"knowledge_base_ids,omitempty"`
|
||||
Limit int `json:"limit,omitempty"`
|
||||
MaxResults int `json:"max_results,omitempty"` // legacy alias for limit
|
||||
Query string `json:"query,omitempty"`
|
||||
}
|
||||
|
||||
// GrepChunksTool performs regex pattern matching across knowledge base chunks.
|
||||
@@ -80,8 +64,8 @@ type GrepChunksTool struct {
|
||||
db *gorm.DB
|
||||
searchTargets types.SearchTargets
|
||||
|
||||
mu sync.Mutex
|
||||
seenChunks map[string]bool
|
||||
mu sync.Mutex
|
||||
seenChunks map[string]bool
|
||||
}
|
||||
|
||||
// NewGrepChunksTool creates a new grep chunks tool
|
||||
@@ -107,58 +91,38 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
|
||||
}, err
|
||||
}
|
||||
|
||||
// Accept both canonical (`queries`) and legacy (`patterns`) field names.
|
||||
// When both are present we concatenate, preserving whichever came first,
|
||||
// so a caller migrating between the two won't end up with nothing.
|
||||
rawQueries := append([]string{}, input.Queries...)
|
||||
rawQueries = append(rawQueries, input.Patterns...)
|
||||
// Resolve the canonical single-string `query`, falling back to legacy
|
||||
// aliases. Legacy array inputs are joined with `|` so they degrade into
|
||||
// a single alternation regex — preserving the previous "match ANY"
|
||||
// semantics without requiring multiple DB scans.
|
||||
query := strings.TrimSpace(input.Query)
|
||||
|
||||
queries := make([]string, 0, len(rawQueries))
|
||||
for _, q := range rawQueries {
|
||||
if strings.TrimSpace(q) != "" {
|
||||
queries = append(queries, q)
|
||||
}
|
||||
}
|
||||
|
||||
if len(queries) == 0 {
|
||||
logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty queries parameter")
|
||||
if query == "" {
|
||||
logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty query parameter")
|
||||
return &types.ToolResult{
|
||||
Success: false,
|
||||
Error: "queries parameter is required and must contain at least one non-empty regex query",
|
||||
}, fmt.Errorf("missing queries parameter")
|
||||
}
|
||||
if len(queries) > 5 {
|
||||
queries = queries[:5]
|
||||
Error: "query parameter is required and must be a non-empty regex string",
|
||||
}, fmt.Errorf("missing query parameter")
|
||||
}
|
||||
|
||||
// Compile queries with (?i) prefix for case-insensitive Go-side matching.
|
||||
// Compile with (?i) prefix for case-insensitive Go-side matching.
|
||||
// Compilation also validates the regex syntax before we send it to the DB.
|
||||
compiled := make([]*regexp.Regexp, 0, len(queries))
|
||||
for _, q := range queries {
|
||||
re, err := regexp.Compile("(?i)" + q)
|
||||
if err != nil {
|
||||
logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", q, err)
|
||||
return &types.ToolResult{
|
||||
Success: false,
|
||||
Error: fmt.Sprintf("invalid regex query %q: %v", q, err),
|
||||
}, err
|
||||
}
|
||||
compiled = append(compiled, re)
|
||||
re, err := regexp.Compile("(?i)" + query)
|
||||
if err != nil {
|
||||
logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", query, err)
|
||||
return &types.ToolResult{
|
||||
Success: false,
|
||||
Error: fmt.Sprintf("invalid regex query %q: %v", query, err),
|
||||
}, err
|
||||
}
|
||||
queries := []string{query}
|
||||
compiled := []*regexp.Regexp{re}
|
||||
|
||||
// Canonical `limit`, with `max_results` accepted as legacy alias.
|
||||
limit := input.Limit
|
||||
if limit <= 0 && input.MaxResults > 0 {
|
||||
limit = input.MaxResults
|
||||
}
|
||||
if limit <= 0 {
|
||||
limit = 30
|
||||
}
|
||||
if limit > 100 {
|
||||
limit = 100
|
||||
}
|
||||
// Result count is controlled by the backend, not the caller — keep it
|
||||
// bounded so the LLM context stays small regardless of regex breadth.
|
||||
const limit = 30
|
||||
|
||||
allowedKBIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
|
||||
kbIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
|
||||
kbTenantMap := t.searchTargets.GetKBTenantMap()
|
||||
|
||||
var allowedKnowledgeIDs []string
|
||||
@@ -168,19 +132,6 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
|
||||
}
|
||||
}
|
||||
|
||||
kbIDs := input.KnowledgeBaseIDs
|
||||
if len(kbIDs) == 0 {
|
||||
kbIDs = allowedKBIDs
|
||||
} else {
|
||||
validKBs := make([]string, 0)
|
||||
for _, kbID := range kbIDs {
|
||||
if t.searchTargets.ContainsKB(kbID) {
|
||||
validKBs = append(validKBs, kbID)
|
||||
}
|
||||
}
|
||||
kbIDs = validKBs
|
||||
}
|
||||
|
||||
logger.Infof(ctx, "[Tool][GrepChunks] Queries: %v, Limit: %d, KBs: %v, KnowledgeIDs: %v",
|
||||
queries, limit, kbIDs, allowedKnowledgeIDs)
|
||||
|
||||
@@ -243,8 +194,9 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
|
||||
Success: true,
|
||||
Output: output,
|
||||
Data: map[string]interface{}{
|
||||
"queries": queries,
|
||||
"patterns": queries, // legacy alias; frontend currently reads `patterns`
|
||||
"query": query,
|
||||
"queries": queries, // legacy alias for older frontends
|
||||
"patterns": queries, // legacy alias for older frontends
|
||||
"knowledge_results": aggregatedResults,
|
||||
"result_count": len(aggregatedResults),
|
||||
"total_matches": len(finalResults),
|
||||
|
||||
Reference in New Issue
Block a user