refactor(agent): simplify grep_chunks tool to a single regex query

The grep_chunks tool previously accepted an array of regex queries (1-5) and an optional knowledge_base_ids filter and limit. In practice the LLM either fired multiple near-duplicate calls or split synonyms across entries instead of using POSIX alternation, and KB scoping plus result limit are server-side concerns the model should not control. Reshape the contract to match `grep -E -i` semantics: - Schema accepts a single required `query` string. Combine concepts with `|` alternation in one regex instead of multiple calls. - Drop `knowledge_base_ids` and `limit` from the schema; the tool now always searches the full agent scope and uses a fixed internal cap. - Legacy `pattern`, `queries`, `patterns`, `max_results` keys are still accepted and joined into a single alternation regex so older callers and in-flight model outputs keep working. - Update the agent system prompt template to document the new single `query` field. - Frontend tool title now reads `query`/`queries`/`pattern`/`patterns` in that order so the search text is shown again under the new schema. - Add a dedicated `grepSearch` / `grepSearchFailed` tool status (zh-CN, en-US, ko-KR, ru-RU) and rename the zh-CN tool label to "搜索关键词" so the UI no longer prefixes the call with a generic "调用 ..." label.
2026-06-04 13:30:32 +08:00 · 2026-05-20 19:32:37 +08:00
parent 31f560ecf1
commit f3c7281f47
7 changed files with 69 additions and 99 deletions
--- a/config/prompt_templates/agent_system_prompt.yaml
+++ b/config/prompt_templates/agent_system_prompt.yaml
@@ -128,7 +128,7 @@ templates:

      ### Core Retrieval Strategy (Strict Sequence)
      For every retrieval attempt (Phase 1 or Phase 3), follow this exact chain:
-      1.  **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). STRONGLY PREFER using regex to search for multiple concepts at once — pack 2-3 terms into one alternation query (e.g. `stardust|skyvault|psionic`) rather than firing several single-keyword calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading. Input field is `queries` (array, 1-5).
+      1.  **Entity Anchoring (grep_chunks):** Regex search over chunk content using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Behaves like `grep -E -i`. Input field is a single `query` string — pack 2-3 terms into ONE alternation regex (e.g. `stardust|skyvault|psionic`) rather than firing several calls. Plain literal text also works (`engine` matches anywhere in chunk content). Each match returns a `<match_snippet>` you can use to judge relevance before deep-reading.
      2.  **Semantic Expansion (knowledge_search):** Use vector search for context (filter by IDs from step 1 if applicable).
      3.  **Deep Contextualization (list_knowledge_chunks): MANDATORY.**
          *   Rule: After Step 1 or 2 returns knowledge_ids, you MUST call this tool.
@@ -137,7 +137,7 @@ templates:
      5.  **Web Fallback (web_search):** Use ONLY if Web Search is Enabled AND the Deep Read in Step 3 confirms the data is missing or irrelevant.

      ### Tool Selection Guidelines
-      *   **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input field is `queries` (1–5 regex strings); STRONGLY PREFER one alternation query (`a|b|c`) over multiple single-keyword calls. Literal text like `engine` also works. `knowledge_search` accepts 1–5 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
+      *   **grep_chunks / knowledge_search:** Your "Index". Use these to find *where* the information might be. `grep_chunks` uses PostgreSQL POSIX `~*` regex (case-insensitive; `REGEXP` on MySQL/SQLite) — input is a single `query` string; pack synonyms into ONE alternation regex (`a|b|c`) instead of multiple calls. Literal text like `engine` also works. `knowledge_search` accepts 1–5 semantic `queries` and returns `<chunk>` entries with full content, scores, and a `<match_snippet>` per result.
      *   **list_knowledge_chunks:** Your "Eyes". MUST be used after every search. Use to read what the information is.
      *   **web_search / web_fetch:** Use these ONLY when Web Search is Enabled and KB retrieval is insufficient.
      *   **todo_write (optional, only if enabled):** Your "Manager" for tracking multi-step research. Only use it when the user has added it to the tool list.
@@ -492,7 +492,7 @@ templates:
      * **wiki_search:** Wiki entry point. Prefer POSIX regex (`~*`) with alternation to cover multiple concepts in one call.
      * **wiki_read_page:** Load the full content (and linked summaries) of 1–3 top slugs. Batch multiple slugs in a single call when possible.
      * **knowledge_search:** Semantic retrieval over raw chunks. Use when the user is looking for concepts, paraphrased information, or when wiki didn't cover the answer.
-      * **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input field is `queries` (1–5 regex strings). STRONGLY PREFER using regex to search for multiple concepts at once — one alternation query `error_code_a|error_code_b` beats multiple single-keyword calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
+      * **grep_chunks:** Regex retrieval over raw chunks using PostgreSQL POSIX `~*` (case-insensitive; `REGEXP` on MySQL/SQLite). Input is a single `query` string. Pack multiple concepts into ONE alternation regex (`error_code_a|error_code_b`) — do not split into multiple calls. Literal text also works. Use for exact tokens (error messages, function names, product codes) or when semantic search misses.
      * **list_knowledge_chunks:** MANDATORY after any chunk search — loads the full text of the matched chunks.
      * **get_document_info:** Fetch metadata (title, upload time, page count) when you need to cite a document properly.
      * **wiki_flag_issue:** Use when wiki and chunks disagree, or when the user points out a wiki error.
--- a/frontend/src/i18n/locales/en-US.ts
+++ b/frontend/src/i18n/locales/en-US.ts
@@ -3889,6 +3889,8 @@ export default {
      searchKbFailed: 'Knowledge base search failed',
      webSearch: 'Web search',
      webSearchFailed: 'Web search failed',
+      grepSearch: 'Keyword search',
+      grepSearchFailed: 'Keyword search failed',
      getDocInfo: 'Getting document info',
      getDocInfoFailed: 'Failed to get document info',
      thinkingDone: 'Thinking complete',
--- a/frontend/src/i18n/locales/ko-KR.ts
+++ b/frontend/src/i18n/locales/ko-KR.ts
@@ -3952,6 +3952,8 @@ export default {
      searchKbFailed: '지식베이스 검색 실패',
      webSearch: '웹 검색',
      webSearchFailed: '웹 검색 실패',
+      grepSearch: '키워드 검색',
+      grepSearchFailed: '키워드 검색 실패',
      getDocInfo: '문서 정보 조회',
      getDocInfoFailed: '문서 정보 조회 실패',
      thinkingDone: '사고 완료',
--- a/frontend/src/i18n/locales/ru-RU.ts
+++ b/frontend/src/i18n/locales/ru-RU.ts
@@ -3527,6 +3527,8 @@ export default {
      searchKbFailed: 'Ошибка поиска по базе знаний',
      webSearch: 'Веб-поиск',
      webSearchFailed: 'Ошибка веб-поиска',
+      grepSearch: 'Поиск по ключевым словам',
+      grepSearchFailed: 'Ошибка поиска по ключевым словам',
      getDocInfo: 'Получение информации о документе',
      getDocInfoFailed: 'Ошибка получения информации о документе',
      thinkingDone: 'Размышление завершено',
--- a/frontend/src/i18n/locales/zh-CN.ts
+++ b/frontend/src/i18n/locales/zh-CN.ts
@@ -3723,7 +3723,7 @@ export default {
  tools: {
    multiKbSearch: "跨库搜索",
    knowledgeSearch: "知识库搜索",
-    grepChunks: "文本模式搜索",
+    grepChunks: "搜索关键词",
    getChunkDetail: "获取片段详情",
    listKnowledgeChunks: "查看知识分块",
    listKnowledgeBases: "列出知识库",
@@ -3824,7 +3824,7 @@ export default {
    },
    tools: {
      searchKnowledge: "知识库检索",
-      grepChunks: "文本模式搜索",
+      grepChunks: "搜索关键词",
      webSearch: "网络搜索",
      webFetch: "网页抓取",
      getDocumentInfo: "获取文档信息",
@@ -3884,6 +3884,8 @@ export default {
      searchKbFailed: "检索知识库失败",
      webSearch: "网络搜索",
      webSearchFailed: "网络搜索失败",
+      grepSearch: "搜索关键词",
+      grepSearchFailed: "搜索关键词失败",
      getDocInfo: "获取文档信息",
      getDocInfoFailed: "获取文档信息失败",
      thinkingDone: "完成思考",
--- a/frontend/src/views/chat/components/AgentStreamDisplay.vue
+++ b/frontend/src/views/chat/components/AgentStreamDisplay.vue
@@ -2200,14 +2200,22 @@ const getToolTitle = (event: any): string => {
    // Try to get patterns from arguments or tool_data
    let patterns: string[] = [];
    if (event.arguments && typeof event.arguments === 'object') {
-      if (Array.isArray(event.arguments.patterns)) {
+      if (Array.isArray(event.arguments.queries)) {
+        patterns = event.arguments.queries;
+      } else if (Array.isArray(event.arguments.patterns)) {
        patterns = event.arguments.patterns;
+      } else if (event.arguments.query) {
+        patterns = [event.arguments.query];
      } else if (event.arguments.pattern) {
        patterns = [event.arguments.pattern];
      }
    } else if (event.tool_data) {
-      if (Array.isArray(event.tool_data.patterns)) {
+      if (Array.isArray(event.tool_data.queries)) {
+        patterns = event.tool_data.queries;
+      } else if (Array.isArray(event.tool_data.patterns)) {
        patterns = event.tool_data.patterns;
+      } else if (event.tool_data.query) {
+        patterns = [event.tool_data.query];
      } else if (event.tool_data.pattern) {
        patterns = [event.tool_data.pattern];
      }
@@ -2244,6 +2252,8 @@ const getToolDescription = (event: any): string => {
    return success ? t('agentStream.toolStatus.searchKb') : t('agentStream.toolStatus.searchKbFailed');
  } else if (toolName === 'web_search') {
    return success ? t('agentStream.toolStatus.webSearch') : t('agentStream.toolStatus.webSearchFailed');
+  } else if (toolName === 'grep_chunks') {
+    return success ? t('agentStream.toolStatus.grepSearch') : t('agentStream.toolStatus.grepSearchFailed');
  } else if (toolName === 'get_document_info') {
    return success ? t('agentStream.toolStatus.getDocInfo') : t('agentStream.toolStatus.getDocInfoFailed');
  } else if (toolName === 'thinking') {
--- a/internal/agent/tools/grep_chunks.go
+++ b/internal/agent/tools/grep_chunks.go
@@ -18,12 +18,12 @@ import (

 var grepChunksTool = BaseTool{
 	name: ToolGrepChunks,
-	description: `Search knowledge base chunk content using PostgreSQL POSIX regular expressions (~* operator, case-insensitive; REGEXP on MySQL/SQLite).
-STRONGLY PREFER using regex to search for multiple concepts at once rather than simple plain text queries.
-Returns matching chunks with per-pattern hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
+	description: `Search knowledge base chunk content with a single POSIX regular expression, applied directly in the database (PostgreSQL ~* / MySQL/SQLite REGEXP, case-insensitive). Behaves like ` + "`grep -E -i`" + `.
+Pack multiple concepts into ONE regex using ` + "`|`" + ` alternation — do not call this tool repeatedly for synonyms.
+Returns matching chunks with hit counts and a <match_snippet> around the first match (each tagged with its knowledge_id and chunk_id).
 Examples:
- Alternation (RECOMMENDED): "stardust|skyvault" (matches either word)
- Multiple terms (RECOMMENDED): "psionic.*engine" (matches both words in order)
+- Alternation (RECOMMENDED): "stardust|skyvault|psionic" (matches any of the words)
+- Multiple terms in order: "psionic.*engine" (matches both words in order)
 - Word boundary / anchor: "\\brag\\b" or "^chapter\\s+\\d+"
 - Plain text: "engine" (matches literal substring anywhere in chunk content)
 IMPORTANT — JSON escaping: every backslash in a regex MUST be written as \\ inside the JSON tool arguments (e.g. to search for literal "C++" write "C\\+\\+", NOT "C\+\+"; for "\d+" write "\\d+"). Plain "\+" / "\d" etc. are invalid JSON escapes and will fail to parse.
@@ -31,40 +31,24 @@ Use this to locate candidate chunks by exact identifiers, error codes, product n
 	schema: json.RawMessage(`{
  "type": "object",
  "properties": {
-    "queries": {
-      "type": "array",
-      "items": { "type": "string" },
-      "description": "List of regex queries to run. A chunk matches when ANY query matches its content. Prefer one alternation query (\"a|b|c\") over multiple single-keyword queries.",
-      "minItems": 1,
-      "maxItems": 5
-    },
-    "knowledge_base_ids": {
-      "type": "array",
-      "items": { "type": "string" },
-      "description": "Optional: restrict search to specific KB IDs within the agent scope."
-    },
-    "limit": {
-      "type": "integer",
-      "description": "Max matching chunks to return (default 30, max 100).",
-      "default": 30,
-      "minimum": 1,
-      "maximum": 100
+    "query": {
+      "type": "string",
+      "description": "A single POSIX regex applied directly to chunk content (case-insensitive). Combine multiple concepts with \"|\" alternation in ONE regex (e.g. \"stardust|skyvault|psionic\") — do not split into multiple calls.",
+      "minLength": 1
    }
  },
-  "required": ["queries"]
+  "required": ["query"]
 }`),
 }

 // GrepChunksInput defines the input parameters for grep chunks tool.
-// The canonical parameter names are `queries` and `limit` (mirroring
-// wiki_search). The legacy `patterns` and `max_results` keys remain accepted
-// so older model outputs or external callers don't break silently.
+// The canonical parameter is a single `query` string (a regex with optional
+// `|` alternation), matching real `grep -E` semantics. Legacy array forms
+// (`queries`, `patterns`) and the singular `pattern` alias remain accepted
+// so older model outputs or external callers don't break silently — they
+// are joined together into a single alternation regex before execution.
 type GrepChunksInput struct {
-	Queries          []string `json:"queries,omitempty"`
-	Patterns         []string `json:"patterns,omitempty"` // legacy alias for queries
-	KnowledgeBaseIDs []string `json:"knowledge_base_ids,omitempty"`
-	Limit            int      `json:"limit,omitempty"`
-	MaxResults       int      `json:"max_results,omitempty"` // legacy alias for limit
+	Query string `json:"query,omitempty"`
 }

 // GrepChunksTool performs regex pattern matching across knowledge base chunks.
@@ -80,8 +64,8 @@ type GrepChunksTool struct {
 	db            *gorm.DB
 	searchTargets types.SearchTargets

-	mu          sync.Mutex
-	seenChunks  map[string]bool
+	mu         sync.Mutex
+	seenChunks map[string]bool
 }

 // NewGrepChunksTool creates a new grep chunks tool
@@ -107,58 +91,38 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
 		}, err
 	}

-	// Accept both canonical (`queries`) and legacy (`patterns`) field names.
-	// When both are present we concatenate, preserving whichever came first,
-	// so a caller migrating between the two won't end up with nothing.
-	rawQueries := append([]string{}, input.Queries...)
-	rawQueries = append(rawQueries, input.Patterns...)
+	// Resolve the canonical single-string `query`, falling back to legacy
+	// aliases. Legacy array inputs are joined with `|` so they degrade into
+	// a single alternation regex — preserving the previous "match ANY"
+	// semantics without requiring multiple DB scans.
+	query := strings.TrimSpace(input.Query)

-	queries := make([]string, 0, len(rawQueries))
-	for _, q := range rawQueries {
-		if strings.TrimSpace(q) != "" {
-			queries = append(queries, q)
-		}
-	}
-
-	if len(queries) == 0 {
-		logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty queries parameter")
+	if query == "" {
+		logger.Errorf(ctx, "[Tool][GrepChunks] Missing or empty query parameter")
 		return &types.ToolResult{
 			Success: false,
-			Error:   "queries parameter is required and must contain at least one non-empty regex query",
-		}, fmt.Errorf("missing queries parameter")
-	}
-	if len(queries) > 5 {
-		queries = queries[:5]
+			Error:   "query parameter is required and must be a non-empty regex string",
+		}, fmt.Errorf("missing query parameter")
 	}

-	// Compile queries with (?i) prefix for case-insensitive Go-side matching.
+	// Compile with (?i) prefix for case-insensitive Go-side matching.
 	// Compilation also validates the regex syntax before we send it to the DB.
-	compiled := make([]*regexp.Regexp, 0, len(queries))
-	for _, q := range queries {
-		re, err := regexp.Compile("(?i)" + q)
-		if err != nil {
-			logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", q, err)
-			return &types.ToolResult{
-				Success: false,
-				Error:   fmt.Sprintf("invalid regex query %q: %v", q, err),
-			}, err
-		}
-		compiled = append(compiled, re)
+	re, err := regexp.Compile("(?i)" + query)
+	if err != nil {
+		logger.Errorf(ctx, "[Tool][GrepChunks] Invalid regex %q: %v", query, err)
+		return &types.ToolResult{
+			Success: false,
+			Error:   fmt.Sprintf("invalid regex query %q: %v", query, err),
+		}, err
 	}
+	queries := []string{query}
+	compiled := []*regexp.Regexp{re}

-	// Canonical `limit`, with `max_results` accepted as legacy alias.
-	limit := input.Limit
-	if limit <= 0 && input.MaxResults > 0 {
-		limit = input.MaxResults
-	}
-	if limit <= 0 {
-		limit = 30
-	}
-	if limit > 100 {
-		limit = 100
-	}
+	// Result count is controlled by the backend, not the caller — keep it
+	// bounded so the LLM context stays small regardless of regex breadth.
+	const limit = 30

-	allowedKBIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
+	kbIDs := t.searchTargets.GetAllKnowledgeBaseIDs()
 	kbTenantMap := t.searchTargets.GetKBTenantMap()

 	var allowedKnowledgeIDs []string
@@ -168,19 +132,6 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
 		}
 	}

-	kbIDs := input.KnowledgeBaseIDs
-	if len(kbIDs) == 0 {
-		kbIDs = allowedKBIDs
-	} else {
-		validKBs := make([]string, 0)
-		for _, kbID := range kbIDs {
-			if t.searchTargets.ContainsKB(kbID) {
-				validKBs = append(validKBs, kbID)
-			}
-		}
-		kbIDs = validKBs
-	}
-
 	logger.Infof(ctx, "[Tool][GrepChunks] Queries: %v, Limit: %d, KBs: %v, KnowledgeIDs: %v",
 		queries, limit, kbIDs, allowedKnowledgeIDs)

@@ -243,8 +194,9 @@ func (t *GrepChunksTool) Execute(ctx context.Context, args json.RawMessage) (*ty
 		Success: true,
 		Output:  output,
 		Data: map[string]interface{}{
-			"queries":            queries,
-			"patterns":           queries, // legacy alias; frontend currently reads `patterns`
+			"query":              query,
+			"queries":            queries, // legacy alias for older frontends
+			"patterns":           queries, // legacy alias for older frontends
 			"knowledge_results":  aggregatedResults,
 			"result_count":       len(aggregatedResults),
 			"total_matches":      len(finalResults),