refactor(agent): simplify grep_chunks tool to a single regex query

The grep_chunks tool previously accepted an array of regex queries (1-5)
and an optional knowledge_base_ids filter and limit. In practice the LLM
either fired multiple near-duplicate calls or split synonyms across
entries instead of using POSIX alternation, and KB scoping plus result
limit are server-side concerns the model should not control.

Reshape the contract to match `grep -E -i` semantics:

- Schema accepts a single required `query` string. Combine concepts with
  `|` alternation in one regex instead of multiple calls.
- Drop `knowledge_base_ids` and `limit` from the schema; the tool now
  always searches the full agent scope and uses a fixed internal cap.
- Legacy `pattern`, `queries`, `patterns`, `max_results` keys are still
  accepted and joined into a single alternation regex so older callers
  and in-flight model outputs keep working.
- Update the agent system prompt template to document the new single
  `query` field.
- Frontend tool title now reads `query`/`queries`/`pattern`/`patterns`
  in that order so the search text is shown again under the new schema.
- Add a dedicated `grepSearch` / `grepSearchFailed` tool status (zh-CN,
  en-US, ko-KR, ru-RU) and rename the zh-CN tool label to "搜索关键词"
  so the UI no longer prefixes the call with a generic "调用 ..." label.
This commit is contained in:
wizardchen
2026-05-20 19:32:37 +08:00
committed by lyingbug
parent 31f560ecf1
commit f3c7281f47
7 changed files with 69 additions and 99 deletions

View File

@@ -128,7 +128,7 @@ templates:
### Core Retrieval Strategy (Strict Sequence)
For every retrieval attempt (Phase 1 or Phase 3), follow this exact chain:
1. **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). STRONGLY PREFER using regex to search for multiple concepts at once — pack 2-3 terms into one alternation query (e.g. `stardust|skyvault|psionic`) rather than firing several single-keyword calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading. Input field is `queries` (array, 1-5).
1. **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Behaves like `grep -E -i`. Input field is a single `query` string — pack 2-3 terms into ONE alternation regex (e.g. `stardust|skyvault|psionic`) rather than firing several calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading.
2. **Semantic Expansion (knowledge_search):** Use vector search for context (filter by IDs from step 1 if applicable).
3. **Deep Contextualization (list_knowledge_chunks): MANDATORY.**
* Rule: After Step 1 or 2 returns knowledge_ids, you MUST call this tool.
@@ -137,7 +137,7 @@ templates:
5. **Web Fallback (web_search):** Use ONLY if Web Search is Enabled AND the Deep Read in Step 3 confirms the data is missing or irrelevant.
### Tool Selection Guidelines
* **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input field is `queries` (15 regex strings); STRONGLY PREFER one alternation query (`a|b|c`) over multiple single-keyword calls. Literal text like `engine` also works. `knowledge_search` accepts 15 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
* **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input is a single `query` string; pack synonyms into ONE alternation regex (`a|b|c`) instead of multiple calls. Literal text like `engine` also works. `knowledge_search` accepts 15 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
* **list_knowledge_chunks:** Your "Eyes". MUST be used after every search. Use to read what the information is.
* **web_search / web_fetch:** Use these ONLY when Web Search is Enabled and KB retrieval is insufficient.
* **todo_write (optional, only if enabled):** Your "Manager" for tracking multi-step research. Only use it when the user has added it to the tool list.
@@ -492,7 +492,7 @@ templates:
* **wiki_search:** Wiki entry point. Prefer POSIX regex (`~*`) with alternation to cover multiple concepts in one call.
* **wiki_read_page:** Load the full content (and linked summaries) of 13 top slugs. Batch multiple slugs in a single call when possible.
* **knowledge_search:** Semantic retrieval over raw chunks. Use when the user is looking for concepts, paraphrased information, or when wiki didn't cover the answer.
* **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input field is `queries` (15 regex strings). STRONGLY PREFER using regex to search for multiple concepts at once — one alternation query `error_code_a|error_code_b` beats multiple single-keyword calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
* **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input is a single `query` string. Pack multiple concepts into ONE alternation regex (`error_code_a|error_code_b`) — do not split into multiple calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
* **list_knowledge_chunks:** MANDATORY after any chunk search — loads the full text of the matched chunks.
* **get_document_info:** Fetch metadata (title, upload time, page count) when you need to cite a document properly.
* **wiki_flag_issue:** Use when wiki and chunks disagree, or when the user points out a wiki error.

View File

@@ -3889,6 +3889,8 @@ export default {
searchKbFailed: 'Knowledge base search failed',
webSearch: 'Web search',
webSearchFailed: 'Web search failed',
grepSearch: 'Keyword search',
grepSearchFailed: 'Keyword search failed',
getDocInfo: 'Getting document info',
getDocInfoFailed: 'Failed to get document info',
thinkingDone: 'Thinking complete',

View File

@@ -3952,6 +3952,8 @@ export default {
searchKbFailed: '지식베이스 검색 실패',
webSearch: '웹 검색',
webSearchFailed: '웹 검색 실패',
grepSearch: '키워드 검색',
grepSearchFailed: '키워드 검색 실패',
getDocInfo: '문서 정보 조회',
getDocInfoFailed: '문서 정보 조회 실패',
thinkingDone: '사고 완료',

View File

@@ -3527,6 +3527,8 @@ export default {
searchKbFailed: 'Ошибка поиска по базе знаний',
webSearch: 'Веб-поиск',
webSearchFailed: 'Ошибка веб-поиска',
grepSearch: 'Поиск по ключевым словам',
grepSearchFailed: 'Ошибка поиска по ключевым словам',
getDocInfo: 'Получение информации о документе',
getDocInfoFailed: 'Ошибка получения информации о документе',
thinkingDone: 'Размышление завершено',

View File

@@ -3723,7 +3723,7 @@ export default {
tools: {
multiKbSearch: "跨库搜索",
knowledgeSearch: "知识库搜索",
grepChunks: "文本模式搜索",
grepChunks: "搜索关键词",
getChunkDetail: "获取片段详情",
listKnowledgeChunks: "查看知识分块",
listKnowledgeBases: "列出知识库",
@@ -3824,7 +3824,7 @@ export default {
},
tools: {
searchKnowledge: "知识库检索",
grepChunks: "文本模式搜索",
grepChunks: "搜索关键词",
webSearch: "网络搜索",
webFetch: "网页抓取",
getDocumentInfo: "获取文档信息",
@@ -3884,6 +3884,8 @@ export default {
searchKbFailed: "检索知识库失败",
webSearch: "网络搜索",
webSearchFailed: "网络搜索失败",
grepSearch: "搜索关键词",
grepSearchFailed: "搜索关键词失败",
getDocInfo: "获取文档信息",
getDocInfoFailed: "获取文档信息失败",
thinkingDone: "完成思考",

View File

@@ -2200,14 +2200,22 @@ const getToolTitle = (event: any): string => {
// Try to get patterns from arguments or tool_data
let patterns: string[] = [];
if (event.arguments && typeof event.arguments === 'object') {
if (Array.isArray(event.arguments.patterns)) {
if (Array.isArray(event.arguments.queries)) {
patterns = event.arguments.queries;
} else if (Array.isArray(event.arguments.patterns)) {
patterns = event.arguments.patterns;
} else if (event.arguments.query) {
patterns = [event.arguments.query];
} else if (event.arguments.pattern) {
patterns = [event.arguments.pattern];
}
} else if (event.tool_data) {
if (Array.isArray(event.tool_data.patterns)) {
if (Array.isArray(event.tool_data.queries)) {
patterns = event.tool_data.queries;
} else if (Array.isArray(event.tool_data.patterns)) {
patterns = event.tool_data.patterns;
} else if (event.tool_data.query) {
patterns = [event.tool_data.query];
} else if (event.tool_data.pattern) {
patterns = [event.tool_data.pattern];
}
@@ -2244,6 +2252,8 @@ const getToolDescription = (event: any): string => {
return success ? t('agentStream.toolStatus.searchKb') : t('agentStream.toolStatus.searchKbFailed');
} else if (toolName === 'web_search') {
return success ? t('agentStream.toolStatus.webSearch') : t('agentStream.toolStatus.webSearchFailed');
} else if (toolName === 'grep_chunks') {
return success ? t('agentStream.toolStatus.grepSearch') : t('agentStream.toolStatus.grepSearchFailed');
} else if (toolName === 'get_document_info') {
return success ? t('agentStream.toolStatus.getDocInfo') : t('agentStream.toolStatus.getDocInfoFailed');
} else if (toolName === 'thinking') {

View File

@@ -18,12 +18,12 @@ import (
var grepChunksTool = BaseTool{
name: ToolGrepChunks,
description: `Search knowledge base chunk content using PostgreSQL POSIX regular expressions (~* operator, case-insensitive; REGEXP on MySQL/SQLite).
STRONGLY PREFER using regex to search for multiple concepts at once rather than simple plain text queries.
Returns matching chunks with per-pattern hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
description: `Search knowledge base chunk content with a single POSIX regular expression, applied directly in the database (PostgreSQL ~* / MySQL/SQLite REGEXP, case-insensitive). Behaves like ` + "`grep -E -i`" + `.
Pack multiple concepts into ONE regex using ` + "`|`" + ` alternation — do not call this tool repeatedly for synonyms.
Returns matching chunks with hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
Examples:
- Alternation (RECOMMENDED): "stardust|skyvault" (matches either word)
- Multiple terms (RECOMMENDED): "psionic.*engine" (matches both words in order)
- Alternation (RECOMMENDED): "stardust|skyvault|psionic" (matches any of the words)
- Multiple terms in order: "psionic.*engine" (matches both words in order)
- Word boundary / anchor: "\\brag\\b" or "^chapter\\s+\\d+"
- Plain text: "engine" (matches literal substring anywhere in chunk content)
IMPORTANT — JSON escaping: every backslash in a regex MUST be written as \\ inside the JSON tool arguments (e.g. to search for literal "C++" write "C\\+\\+", NOT "C\+\+"; for "\d+" write "\\d+"). Plain "\+" / "\d" etc. are invalid JSON escapes and will fail to parse.
@@ -31,40 +31,24 @@ Use this to locate candidate chunks by exact identifiers, error codes, product n
schema: json.RawMessage(`{
"type": "object",
"properties": {
"queries": {
"type": "array",
"items": { "type": "string" },
"description": "List of regex queries to run. A chunk matches when ANY query matches its content. Prefer one alternation query (\"a|b|c\") over multiple single-keyword queries.",
"minItems": 1,
"maxItems": 5
},
"knowledge_base_ids": {
"type": "array",
"items": { "type": "string" },
"description": "Optional: restrict search to specific KB IDs within the agent scope."
},
"limit": {
"type": "integer",
"description": "Max matching chunks to return (default 30, max 100).",
"default": 30,
"minimum": 1,
"maximum": 100
"query": {
"type": "string",
"description": "A single POSIX regex applied directly to chunk content (case-insensitive). Combine multiple concepts with \"|\" alternation in ONE regex (e.g. \"stardust|skyvault|psionic\") — do not split into multiple calls.",
"minLength": 1
}
},
"required": ["queries"]
"required": ["query"]
}`),
}
// GrepChunksInput defines the input parameters for grep chunks tool.
// The canonical parameter names are `queries` and `limit` (mirroring
// wiki_search). The legacy `patterns` and `max_results` keys remain accepted
// so older model outputs or external callers don't break silently.
// The canonical parameter is a single `query` string (a regex with optional
// `|` alternation), matching real `grep -E` semantics. Legacy array forms
// (`queries`, `patterns`) and the singular `pattern` alias remain accepted
// so older model outputs or external callers don't break silently — they
// are joined together into a single alternation regex before execution.
type GrepChunksInput struct {
Queries []string `json:"queries,omitempty"`
Patterns []string `json:"patterns,omitempty"` // legacy alias for queries
KnowledgeBaseIDs []string `json:"knowledge_base_ids,omitempty"`
Limit int `json:"limit,omitempty"`
MaxResults int `json:"max_results,omitempty"` // legacy alias for limit
Query string `json:"query,omitempty"`
}
// GrepChunksTool performs regex pattern matching across knowledge base chunks.
@@ -80,8 +64,8 @@ type GrepChunksTool struct {
db *gorm.DB
searchTargets types.SearchTargets
mu sync.Mutex
seenChunks map[string]bool
mu sync.Mutex
seenChunks map[string]bool
}
// NewGrepChunksTool creates a new grep chunks tool
@@ -107,58 +91,38 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
}, err
}
// Accept both canonical (`queries`) and legacy (`patterns`) field names.
// When both are present we concatenate, preserving whichever came first,
// so a caller migrating between the two won't end up with nothing.
rawQueries := append([]string{}, input.Queries...)
rawQueries = append(rawQueries, input.Patterns...)
// Resolve the canonical single-string `query`, falling back to legacy
// aliases. Legacy array inputs are joined with `|` so they degrade into
// a single alternation regex — preserving the previous "match ANY"
// semantics without requiring multiple DB scans.
query := strings.TrimSpace(input.Query)
queries := make([]string, 0, len(rawQueries))
for _, q := range rawQueries {
if strings.TrimSpace(q) != "" {
queries = append(queries, q)
}
}
if len(queries) == 0 {
logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty queries parameter")
if query == "" {
logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty query parameter")
return &types.ToolResult{
Success: false,
Error: "queries parameter is required and must contain at least one non-empty regex query",
}, fmt.Errorf("missing queries parameter")
}
if len(queries) > 5 {
queries = queries[:5]
Error: "query parameter is required and must be a non-empty regex string",
}, fmt.Errorf("missing query parameter")
}
// Compile queries with (?i) prefix for case-insensitive Go-side matching.
// Compile with (?i) prefix for case-insensitive Go-side matching.
// Compilation also validates the regex syntax before we send it to the DB.
compiled := make([]*regexp.Regexp, 0, len(queries))
for _, q := range queries {
re, err := regexp.Compile("(?i)" + q)
if err != nil {
logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", q, err)
return &types.ToolResult{
Success: false,
Error: fmt.Sprintf("invalid regex query %q: %v", q, err),
}, err
}
compiled = append(compiled, re)
re, err := regexp.Compile("(?i)" + query)
if err != nil {
logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", query, err)
return &types.ToolResult{
Success: false,
Error: fmt.Sprintf("invalid regex query %q: %v", query, err),
}, err
}
queries := []string{query}
compiled := []*regexp.Regexp{re}
// Canonical `limit`, with `max_results` accepted as legacy alias.
limit := input.Limit
if limit <= 0 && input.MaxResults > 0 {
limit = input.MaxResults
}
if limit <= 0 {
limit = 30
}
if limit > 100 {
limit = 100
}
// Result count is controlled by the backend, not the caller — keep it
// bounded so the LLM context stays small regardless of regex breadth.
const limit = 30
allowedKBIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
kbIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
kbTenantMap := t.searchTargets.GetKBTenantMap()
var allowedKnowledgeIDs []string
@@ -168,19 +132,6 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
}
}
kbIDs := input.KnowledgeBaseIDs
if len(kbIDs) == 0 {
kbIDs = allowedKBIDs
} else {
validKBs := make([]string, 0)
for _, kbID := range kbIDs {
if t.searchTargets.ContainsKB(kbID) {
validKBs = append(validKBs, kbID)
}
}
kbIDs = validKBs
}
logger.Infof(ctx, "[Tool][GrepChunks] Queries: %v, Limit: %d, KBs: %v, KnowledgeIDs: %v",
queries, limit, kbIDs, allowedKnowledgeIDs)
@@ -243,8 +194,9 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
Success: true,
Output: output,
Data: map[string]interface{}{
"queries": queries,
"patterns": queries, // legacy alias; frontend currently reads `patterns`
"query": query,
"queries": queries, // legacy alias for older frontends
"patterns": queries, // legacy alias for older frontends
"knowledge_results": aggregatedResults,
"result_count": len(aggregatedResults),
"total_matches": len(finalResults),