Commit Graph

1936 Commits

Author SHA1 Message Date
mileslai
8ffd1ee2d6 fix(mcp-server): restore all create_session parameters (kb_id, max_rounds, enable_rewrite, fallback_response, summary_model_id)
Restore parameters that were inadvertently removed during refactoring.
- kb_id: Required knowledge base ID (architectural shift from KB-agnostic back to KB-bound sessions)
- max_rounds, enable_rewrite, fallback_response: Session strategy configuration
- summary_model_id: Model for response summarization
- title, description: Optional session metadata

These parameters enable AI agents to fully configure session behavior.
2026-05-29 16:40:29 +08:00
mileslai
e9a242c25f feat(mcp-server): add multi-transport support (stdio / SSE / HTTP) 2026-05-29 16:40:29 +08:00
yaol
00af694c52 feat(mcp): expose read-only wiki tools via Python MCP server
Add 3 read-only wiki tools (wiki_search, wiki_read_page,
wiki_index_view) to the Python MCP server, enabling external agents
like Claude Code and Codex to query WeKnora's LLM-generated wiki
pages following the LLM Wiki pattern.

Closes #1501

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 16:37:12 +08:00
ochan.kwon
40b74e2efa feat(retriever): activate OpenSearch k-NN driver (PR 3 of 3)
Phase 3 (#1440) gate flip. PR 1 (#1445) + PR 2a (#1481) + PR 2b (#1482)
laid the type prep + driver skeleton + read/write paths as gated dead
code; this PR wires every activation surface so opensearch becomes a
registerable VectorStore engine.

Activation wiring
- internal/types: validEngineTypes / GetVectorStoreTypes (with HNSW
  bounds + knn_engine enum + Immutable hints) / retrieverEngineMapping /
  buildEnvStoreForDriver — every gated surface now recognises
  "opensearch". IndexConfig grows four omitempty HNSW fields (HNSWM /
  HNSWEFConstruction / HNSWEFSearch / KNNEngine), keeping other engines'
  serialised config byte-identical.
- internal/container: createOpenSearchEngine + the switch case in
  createEngineServiceFromStore; the RETRIEVE_DRIVER=opensearch env path
  in initRetrieveEngineRegistry; NewEngineFactory now closes over the
  AuditLogService (the EngineFactory type itself is unchanged).
- internal/application/service/vectorstore_healthcheck.go: a
  testOpenSearchConnection case so CreateStore's connectivity probe
  accepts opensearch instead of returning 400.
- internal/application/repository/retriever/opensearch/transport.go:
  NewOpenSearchClient is exported so the factory and env path can build
  the TLS-hardened client; healthcheck.go reuses the unexported
  probeVersion / probeKNNPlugin for the service-layer probe.

Service-layer validation
- validateOpenSearchIndexConfig validates the HNSW caps (m 2-100,
  ef_construction 2-4096, ef_search 1-10000, knn_engine ∈ lucene|faiss).
  Shards/replicas continue to be enforced by the flat ValidateIndexConfig.
  Create-only: UpdateStore mutates the name only.
- validateConnectionConfig requires addr for opensearch.

Sync implementations (stubs.go shrinks)
- CopyIndices (copy.go) mirrors the Elasticsearch / Qdrant pattern —
  search → BatchSave with the source_id remap for generated questions —
  so dim/keyword routing and the source_id contract come from BatchSave
  for free. embeddingMap is keyed by the *target* SourceID because
  OpenSearch's BatchSave looks up embeddings by SourceID
  (lookupEmbedding), not by chunk_id (the ES driver's convention).
  Pagination is from/size; copies larger than max_result_window
  (default 10000) need the scroll-based async path that lands later.
- BatchUpdateChunkEnabledStatus / BatchUpdateChunkTagID (bulk_update.go)
  group the input by target value and issue one _update_by_query per
  group over the cross-dim <base>_* pattern. Caller values flow through
  bound script params only — never string-interpolated into the Painless
  source — closing the script-injection surface.
- inspectByQueryResponse (byquery.go) mirrors inspectBulkResponse: the
  full failure reason goes to the debug log only; the returned error
  carries the bounded id + type.
- UpdateByQueryParams.Refresh is *bool in opensearch-go v4.6.0 (the same
  shape as DeleteByQuery's quirk), so refresh=wait_for is not
  expressible; we use refresh=true.

Driver-owned audit (DIP)
- A new opensearch.AuditSink interface (with nopSink + WithAuditSink
  functional option) lets the driver emit opensearch.index_created and
  opensearch.reindex_executed events without importing any service
  package — the service layer implements the interface. NewRepository
  takes opts, so existing 4-arg test call sites keep compiling unchanged.
- internal/container/audit_sink.go bridges AuditSink to AuditLogService.
  When the context carries no tenant (the env-path registration ctx
  during boot, for example) the adapter skips the emit with a warning
  rather than silently writing tenant_id=0, which would collide with the
  system-scope sentinel.

Frontend + polish
- FieldSchema (frontend/src/api/vector-store.ts) gains min/max/enum/
  immutable. VectorStoreSettings.vue is now schema-driven: a closed
  `enum` renders a t-select; number inputs use the schema's `:min`/`:max`
  and fall back to the legacy replica-vs-shard heuristic only when the
  schema does not pin them; a danger-coloured warning fires when
  insecure_skip_verify is toggled on (the switch and warning are wrapped
  in a vertical stack so the warning sits on its own row below the switch).
- i18n: labels for hnsw_m / hnsw_ef_construction / hnsw_ef_search /
  knn_engine / insecure_skip_verify plus the warning copy in en-US,
  ko-KR, zh-CN, ru-RU.
- docker-compose.dev.yml: an opensearch profile (single-node 3.3.2 with
  security plugin disabled for dev only). OpenSearch Dashboards lives in a
  separate, opt-in opensearch-ui profile so the heavy UI container is not
  forced up alongside the cluster (the driver e2e is fully curl-verifiable
  against :9200). The new docs/dev/opensearch-integration-test.md covers the
  end-to-end exercise and the single-node guidance (set replicas=0 to keep
  the cluster Green).

Gating-guard tests flipped
- The "OpenSearch is NOT in validEngineTypes / mapping / types list /
  env builder / stubs" guard tests from PR 1 / PR 2 are replaced by
  their positive counterparts in this PR. The test suite was the
  activation checklist; the activation flip is its diff.

Backward compatibility
- Additive everywhere. IndexConfig's new HNSW fields are omitempty so
  other engines' serialised config is byte-identical. Existing
  Elasticsearch / Qdrant / Milvus / Weaviate / Doris / TencentVectorDB
  stores are untouched. No migrations.

Test plan
- go build ./... clean
- go vet ./... clean
- gofmt -l clean on touched files
- go test ./... — only TestOssEnsureBucket_CreateFails (Aliyun OSS
  endpoint), the docreader gRPC tests, and the doris SQL-shape tests
  fail; all three are pre-existing on upstream/main and untouched by
  this PR.
- New tests across internal/types, opensearch, service and container —
  including a full end-to-end env-path test that exercises
  initRetrieveEngineRegistry with RETRIEVE_DRIVER=opensearch against an
  httptest cluster.
2026-05-29 16:32:27 +08:00
wizardchen
19a1b15106 fix(container): correct resetPendingTasks SQL on startup
Split knowledge list/update queries to avoid GORM UPDATE...FROM
duplicate-table errors after Find, and use sync_logs started_at/
finished_at column names instead of start_time/end_time.
2026-05-29 15:46:11 +08:00
wizardchen
7d1dfc78b7 fix(knowledgeBase): improve tag loading logic and ensure consistent behavior
- Updated the loadTags function to prevent unnecessary calls when tags are already loading, enhancing performance and user experience.
- Modified tag loading calls in various tag-related functions to ensure the reset parameter is consistently set to true, ensuring the tag list is refreshed correctly after operations like create, edit, and delete.
- Improved the FAQEntryManager component to handle tag loading more efficiently during scrolling and batch operations.
2026-05-29 15:42:44 +08:00
wizardchen
c07ab6988c feat(frontend): enhance IMChannelsOverviewPanel layout and functionality
This update improves the layout and user experience of the IMChannelsOverviewPanel component. Key changes include:

- Added tooltips for subtitles and agent names for better accessibility.
- Refactored channel and agent display logic to improve clarity and consistency.
- Adjusted styling for better visual hierarchy and responsiveness.
- Enhanced toggle functionality for IM channels to ensure state consistency during updates.

These changes aim to provide a more intuitive interface for users managing instant messaging channels.
2026-05-29 15:40:38 +08:00
wizardchen
bb202a203a feat: implement tenant validation for file access and enhance related tests
This commit introduces a new validation mechanism to ensure that file access paths include the correct tenant segment, preventing cross-tenant access. The `ValidateStoragePathTenant` function has been added to enforce this rule, and the `serveFiles` function has been updated to return a forbidden status for invalid paths. Additionally, new tests have been added to verify the behavior of the file service under various tenant scenarios, ensuring robust handling of file access permissions.
2026-05-29 13:47:32 +08:00
wizardchen
48381fbaf5 feat: implement tenant default storage provider handling in Knowledge Base creation
This commit introduces functionality to utilize the tenant's default storage provider when creating a Knowledge Base. It includes updates to the frontend to load the default provider from settings and apply it during Knowledge Base initialization. Additionally, the backend has been enhanced to ensure that the storage provider is set correctly based on tenant configuration, improving consistency across the application. Tests have been added to verify the correct application of the default storage provider in various scenarios.
2026-05-29 13:47:32 +08:00
wizardchen
e0823eff31 feat: enhance IM file service tests and refactor storage URL handling
This commit introduces a new test suite for the IM file service, including a stub implementation for testing purposes. It adds tests for resolving file services based on storage providers and ensures proper fallback mechanisms for MinIO URLs. Additionally, the `rewriteStorageURLs` and `cleanIMContent` functions have been refactored to utilize a resolver for improved caching and efficiency. These changes enhance the robustness of file service handling and improve test coverage for various storage scenarios.
2026-05-29 13:47:32 +08:00
wizardchen
bdb164d432 feat: add tests for incomplete Markdown image handling and improve stream flush logic
This commit introduces a new test, `TestFindIncompleteMarkdownImage`, to validate the detection of incomplete Markdown images in various scenarios. Additionally, it enhances the `holdbackCutoff` function to prioritize handling incomplete Markdown images, ensuring that they are correctly managed during stream flush operations. The changes improve the robustness of image processing in the application, addressing potential issues with unclosed image URLs in Markdown content.
2026-05-29 13:47:32 +08:00
wizardchen
5d02404fdd feat: add unit tests for COS object name parsing and enhance error handling
This commit introduces unit tests for the `parseCosObjectName` method in the `cosFileService`, ensuring it correctly rejects local scheme URLs and properly parses COS scheme URLs. Additionally, the `parseCosObjectName` method has been updated to return an error for unsupported schemes, improving error handling in the `GetFile` and `DeleteFile` methods. This enhancement ensures more robust handling of file paths in the application.
2026-05-29 13:47:32 +08:00
wizardchen
83808cc5b7 feat(frontend): improve agent and KB editor ID display and intent prompts UX
Expose copyable resource IDs in edit modals and replace the intent prompt
dropdown with independent toggle buttons so multi-intent selection wraps cleanly.
2026-05-28 21:11:49 +08:00
wizardchen
0e1282c2da test: add document process task options tests
Centralize DefaultDocumentProcessTimeout in the config package and
reuse DocumentProcessTimeout() from documentProcessTaskOptions.
2026-05-28 20:27:53 +08:00
HelloWeit
edfabf3135 bugfix 2026-05-28 20:22:55 +08:00
wizardchen
f1a27e0e18 feat: implement CancelOpenSpansByName method and related tests
This commit introduces the CancelOpenSpansByName method in the KnowledgeSpanRepository, allowing for the cancellation of open spans by their name for a specific knowledge ID and attempt. This functionality is crucial for managing spans during retries or server restarts, preventing duplicate entries in the trace tree. Additionally, a new test case, TestKnowledgeSpanRepo_CancelOpenSpansByName, has been added to ensure the correct behavior of this method, verifying that only the intended spans are cancelled while others remain unaffected. This enhancement improves the robustness of span management in the application.
2026-05-28 20:16:02 +08:00
wizardchen
e3525f884b refactor: update terminology and improve clarity in knowledge parsing documentation and UI
This commit refines the language used in the knowledge parsing documentation and user interface. Key changes include:

- Updated the description of the `finalizing` state to clarify that it refers to ongoing optimization tasks rather than just completion.
- Modified the confirmation message for canceling parsing to replace "enhancement" with "optimization" for consistency across multiple languages.
- Enhanced the UI to better reflect the current parsing status, including a new function to display appropriate status messages during in-flight parsing.

These changes aim to improve user understanding and experience when interacting with the knowledge parsing features.
2026-05-28 20:16:02 +08:00
wizardchen
44d6175559 feat: add knowledge parse cancellation with finalizing post-process state
Lets users stop an in-flight document parse to free up LLM / worker
resources without losing the chunks and index already written. The
core insight is that the previous parse_status=completed flipped as
soon as primary chunks landed, while the most expensive subtasks
(graph extract = N LLM calls per chunk, plus summary, question
generation) were still running in the background — so "completed"
wasn't actually terminal from a resource standpoint.

State machine

  pending -> processing -> finalizing -> completed
                              |
                              +-> cancelled (any of the three
                                            in-flight states)
                              +-> failed
                              +-> deleting

`finalizing` is the new post-process fan-out window. parse_status
only promotes to `completed` once pending_subtasks_count (a new
column tracking summary + question + per-chunk graph extract)
drains to zero via atomic FinalizeSubtask. Wiki ingest is
intentionally excluded from the counter — it's a KB-scoped
debounced batch and would otherwise pin parse_status in
`finalizing` for the wiki batch window.

Backend

- New ParseStatusFinalizing + pending_subtasks_count column with
  migration 000056.
- knowledgeRepository.SetFinalizing transitions processing -> finalizing
  conditionally so a racing cancel cannot be clobbered.
- knowledgeRepository.FinalizeSubtask atomically decrements the
  counter and self-promotes the row to completed when it hits zero.
- KnowledgePostProcess restructured to compute expected subtask
  count up front, flip to finalizing (or completed when no
  enrichment is enabled), and only then fan out subtasks. Subtask
  handlers (summary, question, graph extract) defer-decrement on
  terminal exit using the existing isFinalAsynqAttempt convention.
- New POST /api/v1/knowledge/{id}/cancel-parse handler accepting
  pending / processing / finalizing. Marks the row cancelled,
  zeroes the counter, best-effort dequeues asynq tasks via a new
  TaskInspector abstraction (asynq-mode walks pending/scheduled/
  retry queues; Lite-mode noop), and scrubs wiki ingest pending op.
- SpanTracker.AbortAttempt flat-sweeps every still-running span
  for the attempt via a new repo.CancelAllOpenSpans helper so the
  trace viewer's striped bars all flip to cancelled, even leaf
  generations whose parent stage already EndSpan'd (multimodal
  fan-out pattern). knowledge_post_process closes its postSpan
  via SkipSpan on the cancel/deleting entry guard so a worker
  that opens a span AFTER the cancel sweep doesn't leak it.
- Housekeeping and resetPendingTasks sweep finalizing rows
  identically to processing so a crash/restart can't strand them.
- DeleteKnowledge/DeleteKnowledgeList proactively dequeue
  downstream tasks via the same TaskInspector path.
- ChunkExtractService gets a cancel entry guard so the most
  expensive enrichment (graph extract) bails immediately when the
  parent knowledge is aborted.

Frontend

- New cancelKnowledgeParse API client + "Stop parsing" entry in
  both list view and card view more menus, gated on
  pending/processing/finalizing.
- Polling predicate refactored to a shared isParseInFlight helper
  that recognises `finalizing` (previously the doc list silently
  stopped polling once parse_status flipped from processing).
- Knowledge processing timeline: isPolling includes finalizing,
  new isHardTerminal short-circuits LIVE for cancelled/failed/
  completed so stranded child spans cannot pin LIVE on.
- DocumentListView.computeStatus distinguishes finalizing
  ("增强中") from completed and shows the previous "生成摘要中"
  copy when summary_status is still pending under finalizing.
  Added cancelled badge as well.
- i18n: statusFinalizing / statusCancelled / cancelParse* keys
  across zh-CN, en-US, ko-KR, ru-RU.

Docs / SDK

- docs/api/knowledge.md: documents the new finalizing state,
  cancel-parse semantics, and which statuses accept cancel.
- client (Go SDK): CancelKnowledgeParse with docstring listing
  the cancellable statuses.
2026-05-28 20:16:02 +08:00
nullkey
c29d36238b docs(cli): AGENTS + README + CHANGELOG for v0.8
AGENTS.md gains three sections for the v0.8 surfaces:
  - Stream recovery   — session continue-stream replay-from-0 semantics
                        and the dedupe contract agents must implement
  - Dry-run contract  — when --dry-run applies, the meta.{dry_run,plan}
                        envelope shape, exit-code semantics (no exit 10
                        on destructive + --dry-run), the GET-reject rule
                        for `weknora api`, and the validation-parity
                        guarantee with the live path
  - Risk metadata     — what the Risk: prefix in --help means and how
                        cobra.Annotations["risk.{level,action}"] are
                        populated

README.md gains user-facing Dry-run preview and Resuming streams
sections.

CHANGELOG.md adds the v0.8 entry covering the new --dry-run flag,
MCP Tool.Annotations, session continue-stream, and the Risk: line.
2026-05-28 19:29:50 +08:00
nullkey
1bae6b6b6c feat(cli): session continue-stream + NDJSON init MessageID
Adds `weknora session continue-stream <session-id> --message <id>` for
re-attaching to an in-progress or already-completed SSE event buffer.

Server semantics (replay-from-0 + tail):
  - Every connection replays the full stored event log from index 0,
    then tails any new events. NOT cursor-from-disconnect. Agents that
    already consumed events on the original stream MUST dedupe by
    message_id + event hash to avoid double-processing.
  - Buffer TTL: redis mode 1h hardcoded; memory mode = process lifetime.
    After expiry the CLI surfaces local.sse_stream_aborted.

Output is NDJSON: one CLI-injected init line carrying
{session_id, message_id, profile} at stream head, then raw SDK
StreamResponse events verbatim. The init line lets agents thread the
resume to the original message in their dedupe table before the first
SDK frame arrives — output.InitEvent gains an omitempty MessageID field
for this purpose; non-resume init events stay unchanged.

The command always emits NDJSON regardless of --format — there is no
human-text use case for raw event-log replay (operator scenarios are
incident response / debugging). --dry-run is excluded for the same
reason streaming commands always are: a buffered plan makes no sense
for an event stream.
2026-05-28 19:29:50 +08:00
nullkey
6d8c8650cd feat(cli): --dry-run + risk metadata + validation parity on 19 mutations
Two intertwined agent safety nets that share the same files:

1. --dry-run flag for offline preview of mutation commands
2. Risk: metadata + SetRisk helper for destructive command surfaces

Coverage (19 mutation commands with --dry-run):
  kb.create/edit/delete           agent.create/edit/delete
  doc.create/upload/fetch/delete  doc.delete_all (special variant)
  session.delete  chunk.delete    profile.add/remove
  auth.refresh/logout             link  unlink
  api.{post,put,patch,delete}     (api.get + --dry-run rejected, exit 2)

Envelope additions (omitempty in non-dry-run paths):
  meta.dry_run: bool        true when --dry-run was used
  meta.plan:    map         {action, args} per the per-command taxonomy

Risk: metadata
--------------
cmdutil.SetRisk(cmd, action) stamps cobra.Annotations with
risk.level=destructive + risk.action=<action> on the 9 destructive
commands. The SetAgentHelp wrapper prepends a "Risk: <action>
(destructive)" line in the default text help path so agents see a clear
warning before parsing Usage. WEKNORA_AGENT_HELP=1 JSON path stays
unchanged — structured agent-help already carries warnings[].

Validation parity with the live path
------------------------------------
Every pure-local validation (flag presence, mutual exclusion, enum
bounds, URL/regex format, ResolveKBLocal for KB resolution that does not
require an SDK call) runs BEFORE the dry-run gate. This matches the
industry-standard "preview shows what live would do" contract:
--dry-run accepts exactly the same invocations the live path accepts and
rejects exactly the same invocations the live path rejects, modulo the
side-effecting work itself.

The side-effecting work (SDK calls, file writes, keyring writes, server-
side name → id resolution) is what --dry-run actually gates. Each
mutation file pairs its RunE validation block with a regression test
under *_dry_run_test.go / dryrun_validation_test.go so future refactors
don't reintroduce the gap.

Helper surface
--------------
- HandleDryRun(cmd, dryRun, plan) extracts the early-return so the
  19 RunE call sites stay 3 lines each.
- EmitDryRun routes through FormatOptions.Emit, inheriting _notice /
  TTY indent / --jq filtering for free.
- ResolveKBLocal mirrors ResolveKB but never calls the SDK; dry-run
  paths use it so the plan reports the raw --kb value (UUID or name)
  without a name → id lookup.

Streaming commands (chat, session ask, session continue-stream) are
deliberately excluded: a buffered plan makes no sense for an event
stream.

Lock semantics in the dry-run path:
  - destructive + --dry-run: exit 0, no exit-10 confirmation prompt
  - --dry-run + -y: byte-identical envelope to --dry-run alone
  - --dry-run + --jq: filter applies to the preview envelope normally
2026-05-28 19:29:50 +08:00
nullkey
c11df51c79 feat(cli): MCP Tool.Annotations on 10 tools (spec 2025-06-18)
Bumps modelcontextprotocol/go-sdk to v1.6.1 and populates Tool.Annotations
on every registered MCP tool per the per-tool hint table:

  Read tools (8):   destructiveHint=false, readOnlyHint=true,
                    idempotentHint=true,   openWorldHint=false
  Invoke tools (2): destructiveHint=false, readOnlyHint=false,
                    idempotentHint=false,  openWorldHint=true

Invoke-class tools (chat, agent_invoke) carry openWorldHint=true because
the server may dispatch external skills (web_search etc.). Read tools are
sealed: idempotent + read-only + closed-world.

TestToolAnnotations_AllToolsHaveExpectedHints asserts the matrix so any
future drift surfaces in CI rather than at first client integration.
2026-05-28 19:29:50 +08:00
wolfkill
7b2ef8bd8e fix: show agent id in editor 2026-05-28 19:22:54 +08:00
wolfkill
43565c5d1b fix: add model display name 2026-05-28 19:22:17 +08:00
wizardchen
4e58dd42cc feat(knowledge-base): integrate trace availability checks and enhance UI interactions
This commit introduces a new utility function, `knowledgeSpansPayloadHasTrace`, to determine if the knowledge spans data contains a valid trace. Key changes include:

1. Updated the `KnowledgeBase` component to utilize the new trace availability checks, improving the logic for displaying trace-related UI elements.
2. Enhanced the `fetchSpans` function in the `knowledge-processing-timeline` component to emit trace availability based on the new utility.
3. Implemented caching for trace availability to optimize performance and reduce unnecessary API calls.

These changes aim to improve user experience by providing accurate trace information and enhancing the overall responsiveness of the UI.
2026-05-28 15:26:53 +08:00
wizardchen
a0547729b2 feat(trace-drawer): implement resizable trace drawer and enhance UI interactions
This commit introduces a resizable trace drawer in the `doc-content` component, allowing users to adjust the width for better visibility. Key changes include:

1. Added functionality to save and load the drawer width from local storage.
2. Implemented mouse events for resizing the drawer, enhancing user interaction.
3. Updated the UI to reflect the new drawer width dynamically.
4. Enhanced the trace entry button for improved accessibility and clarity.

These changes aim to improve user experience by providing a more flexible and user-friendly interface for trace inspection.
2026-05-28 15:14:45 +08:00
wizardchen
d12273255d refactor(timeline): improve polling logic and documentation for gracePoll
This commit enhances the polling mechanism in the KnowledgeProcessingTimeline component by introducing a new function, `shouldPollNow`, which clarifies the conditions under which polling should occur based on the `gracePoll` prop. The documentation for `gracePoll` has been expanded to provide clearer semantics for both user-visible and background mounts. Additionally, minor formatting adjustments were made to improve code readability. These changes aim to streamline the polling behavior and enhance the overall user experience.
2026-05-28 15:14:45 +08:00
wizardchen
797f55c567 feat(timeline): enhance knowledge processing spans and UI elements
This commit introduces several improvements to the knowledge processing timeline and related components. Key changes include:

1. Added a `gracePoll` prop to the `KnowledgeProcessingTimeline` component to manage polling behavior more effectively.
2. Enhanced the UI by displaying the document title in the drawer, improving user visibility of the current document context.
3. Implemented new CSS classes for better styling of the drawer title bar, ensuring a more polished appearance.
4. Updated the backend to support the new `WikiSpan` tracking, allowing for detailed monitoring of document processing stages.

These changes aim to improve user experience and provide better insights into the document processing workflow.
2026-05-28 15:14:45 +08:00
wizardchen
e697eee07f feat(trace): enable direct access to trace drawer from card menu
This commit introduces a new method to open the trace drawer directly from the card menu, enhancing user experience by allowing immediate access to trace details without navigating through the document detail drawer. The implementation includes updates to the `handleViewTrace` function to ensure the correct knowledge ID and parse status are set before opening the trace drawer. Additionally, minor adjustments were made to the UI for better consistency and clarity.
2026-05-28 15:14:45 +08:00
wizardchen
df3b72c0fe feat(migrations): add knowledge_processing_spans table and rollback script
This commit introduces the `knowledge_processing_spans` table to track the progress of document parsing stages, enabling better visibility and error handling in the frontend. The schema includes fields for span details, status, and timestamps, along with necessary indexes for efficient querying. A rollback script is also provided to drop the table and its associated indexes if needed.
2026-05-28 15:14:45 +08:00
wizardchen
2a4d6f9019 fix: preserve span input/metadata, auto-expand subspans, force-arm poll
Three fixes in response to user feedback:

1. Span input disappearing on End/Fail
The Upsert's DoUpdates always listed input/output/metadata, so calls
that only set output (EndSpan) or only set error_* (FailSpan) wrote
NULL into input/metadata, clobbering whatever Begin had recorded.
Build the column list dynamically: skip input/output/metadata when
the incoming row's value is nil. nil now means "preserve existing"
(matches user's intuition "Begin recorded it, End shouldn't erase it").

2. Subspans not auto-expanded
Stages with children (multimodal.image[*], postprocess.summary,
postprocess.question, postprocess.graph.chunk[*]) required a click
on the ▸ caret to surface — easy to miss. On the FIRST successful
fetch per (knowledgeId × attempt), auto-expand any stage that has
children. Subsequent polls honor whatever the user has collapsed,
so manual collapse mid-parse stays collapsed.

3. Auto-poll still not firing
Force-arm the polling interval in onMounted regardless of state.
The per-tick callback decides whether to actually fetch based on
current parse_status — so the loop can never get stranded waiting
on a status transition that already happened. Added a console.debug
when the interval arms, so we can verify from DevTools console that
polling is actually running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
baa4e75d4a fix(timeline): switch to setInterval polling + watchdog watcher
User report: opened a parsing-in-progress doc, /spans was hit ONCE on
mount and then never again — but parse_status said 'processing' and
the LIVE badge was lit. The setTimeout-based recursive scheduler had
a fragile property: any unexpected throw or skipped tick between the
finally block and the schedule check could silently strand the loop
with no way to re-arm.

Refactor to a self-healing pattern:
- setInterval polling: a single 2s tick checks "should we be polling?
  and is no fetch in flight?" — if both, fire fetchSpans. Decision is
  re-evaluated every tick from current reactive state, so flipping
  status from 'processing' to 'completed' between ticks naturally
  stops further fetches.
- ensurePolling() helper is idempotent — calling it from fetchSpans's
  finally, from a parse_status watcher, and from onMounted all just
  arm the same single interval if not already running.
- watch(data.value.parse_status) acts as a watchdog: any time the
  status flips into 'pending'/'processing', re-arm. Any time it flips
  out, tear down. This is the belt-and-suspenders that ensures the
  loop can't get stranded even if a future regression re-introduces
  a missed-schedule path.
- fetchInFlight guard prevents overlapping fetches if a slow backend
  takes longer than POLL_INTERVAL_MS to respond.
- Console warns on fetch failure — silent rejections is exactly what
  hid this bug. With this in place we can verify the loop's health
  from DevTools console.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
babf9139ec fix(timeline): hide "更新于" once trace is terminal
User report: opening an already-completed doc still shows "更新于 X 秒前"
ticking up forever, which is misleading — the data IS final, no
freshness concern exists. Re-gate the caption on isLive (was always
on) so it only appears while pending/processing. Failures pill is
still gated on the same isLive condition, which is the only state
where polling-failure feedback is meaningful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
9333869439 fix(timeline): track fetch attempts and surface silent polling failures
User reported: spinner spinning + "更新于 X 秒前" continuously growing
without bound while parsing is in progress. The auto-poll loop IS
firing every 2s, but lastFetchedAt only updated on success — so when
the API returns success=false or errors, the caption silently aged
while the visible state suggested live polling.

- Move lastFetchedAt update into the finally block so it tracks
  attempts (success or failure). The caption now bounces back to
  "刚刚 / 1s 前 / 刚刚" every poll cycle, matching the spinner's
  visual cadence.
- Track failedAttempts and lastFetchOk: when the last attempt failed
  the timestamp shows in muted italic; when 2+ consecutive attempts
  fail a small "⚠ 刷新失败" pill appears next to it with a tooltip
  explaining the situation.
- New i18n keys fetchFailed / fetchFailedShort in zh-CN/en-US/ko-KR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
c4fba77a96 refactor(timeline): move Trace entry to header pill, simplify refresh UX
The Trace trigger was a giant card-style row with a green left strip
sitting above 文件名/摘要 — visually competed with document content
and felt out of place vs. the rest of the app's quiet density.

Trace entry:
- Replaces the big card with a compact rounded pill in the drawer
  header, next to the file-type t-tag.
- Reads as a quiet secondary action — status dot + label + duration
  + chevron — at the same visual weight as the file-type tag.
- Status dot still pulses while parsing is in progress so the LIVE
  signal survives the size cut.
- Hidden mount preserved so the pill's status/duration stays live
  even before the user opens the secondary drawer.

Refresh UX:
- "更新于 X 秒前" caption now always visible (was hidden once polling
  stopped). After parsing completes it tells the user how stale the
  data is when they reopen the drawer hours later — the original
  reason it was hidden ("ticks up forever") is exactly the staleness
  signal one needs.
- LIVE badge tooltip simplified from "解析中,每 2 秒自动刷新({n}s 后)"
  (cryptic countdown) to "解析进行中 — 每 2 秒自动刷新".
- Manual refresh icon now slow-spins in brand color while auto-poll
  is active — users can tell at a glance "auto-refresh is on" without
  needing the LIVE badge to interpret. Manual click still triggers
  the fast spin.
- Drops the unused nextRefreshIn computed.

i18n: liveTooltip rewritten in zh-CN/en-US/ko-KR; new keys
autoRefreshOn / autoRefreshOff for the button title states.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
5dc0a49a9b feat(timeline): enrich per-image multimodal subspan output
The per-image multimodal subspan only captured image_url / enable_ocr /
enable_caption on input and chunk_id on output, so the trace viewer
could not answer "what did THIS image actually produce?" without
joining back to the chunks table.

Adds to the per-image span output:
- vlm_model_id (or "legacy_inline" for inline-config KBs)
- image_bytes (read size)
- ocr_prompt: "default" | "scanned_pdf"
- ocr_chars + ocr_preview (sanitized text, capped at 200 runes)
- caption_chars + caption_preview
- chunks_created (count of OCR/caption child chunks)
- indexed (true after BatchIndex completes)
- per-step error fields (read_error / ocr_error / caption_error /
  skipped reason) when something fails

Also adds parent_chunk_id to the span input so the trace links back to
the text chunk this image hangs off — useful when a doc has hundreds
of inline images and you need to know WHERE in the text this one came
from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
f99bcb79e7 feat(timeline): scale axis to async tail, enrich span payloads, polish UI
The post-process stage closes in ~9ms (just enqueue work) but its async
subspans (postprocess.summary, postprocess.question, postprocess.graph)
keep producing rows for tens of seconds AFTER the root finalizes. The
old timeline used trace.duration_ms as the time axis maximum, which
clipped those subspan bars past the right edge.

Timeline:
- totalMs now always takes max(trace.duration_ms, observed-tail), so
  the axis stretches to fit the latest descendant end regardless of
  parse_status.
- Render a faint dashed wrapping outline behind a parent span when
  its descendants extend past its own finished_at, so the postprocess
  stage row visibly spans the full window without overloading the
  9ms self-time bar.
- Tree expand/collapse caret bumped from 10 to 14px in a 16x16 hit
  area; copy icons in detail panel bumped from 11/14 to 14/18px;
  .kp-kv-copy button grown from 18 to 22px.
- Short input/output payloads (<= 8 entries / <= 600 bytes JSON)
  auto-expand inline so users see the actual data without an extra
  click; longer payloads keep the click-to-expand summary.

Span payloads (subspans only - root keeps the canonical identity, no
duplicate knowledge_id/kb_id/tenant_id on every child):
- extract.go: graph subspan output gains chunk_chars, chunk_preview,
  sample_nodes, sample_relations.
- summary subspan output gains model_id, summary_preview.
- question subspan output gains model_id and a sample_question
  captured from the first non-empty LLM response.

i18n: new key knowledgeStages.detail.includingChildren for the
wrapping-bar tooltip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
bd029b6d19 feat(knowledge): instrument postprocess subspans + polish timeline UI
Backend: summary, question, and graph-extract async tasks now record
real processing time as subspans hanging off the (closed) postprocess
stage, so the trace viewer no longer caps the postprocess row at the
~10 ms enqueue duration. Carries Attempt through SummaryGeneration /
QuestionGeneration / ExtractChunk payloads so cross-process workers
can resolve the right parent attempt.

Frontend: drawer now uses attach="body" so the secondary 820px detail
drawer escapes the 654px container of the main drawer; timeline
timestamps include date prefix (MM-DD, or YYYY-MM-DD across years);
"updated X ago" caption only shows during live polling.

Tests: 4 new cases covering postprocess subspan attaching under a
closed parent, missing-parent fallthrough, and Attempt JSON round-trip
on the three task payloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 15:14:45 +08:00
wizardchen
5ffc755e76 fix(knowledge): infer synthesized stage status from parse_status
Pre-tracker historical knowledge has zero rows in
knowledge_processing_spans but parse_status correctly reads
"completed" or "failed". The /spans handler was synthesizing five
"pending" placeholders unconditionally, so legacy completed documents
rendered as if they were stuck waiting in the queue forever.

buildSpanTree now takes parse_status and chooses the placeholder
status accordingly:
- ParseStatusCompleted -> done
- ParseStatusFailed    -> failed
- everything else      -> pending (existing behaviour)

Real rows always take precedence; this only changes what we put in
the gaps. So healthy in-flight parses (parse_status=processing,
some real rows, some still pending) keep showing pending placeholders
exactly as before — the synthesized "completed" inference only fires
when the parse already hit terminal state.

Adds TestBuildSpanTree_LegacyCompletedRendersAsDone covering both
the completed-legacy and failed-legacy branches.
2026-05-28 15:14:45 +08:00
wizardchen
01d00c2a36 refactor(frontend): rebuild timeline UI with side drawer + attempt tabs
The previous Langfuse-style waterfall had concrete UX problems visible
in the field: the status text column wrapped Chinese characters across
two lines, the root span name truncated mid-identifier, the inline
detail expansion shifted sibling rows out of order, and the attempt
selector was a plain HTML <select>.

This commit rebuilds the visual layer around three corrections:

- Status column removed entirely. Status is conveyed by the 8 px dot
  and the bar color; the redundant Chinese label that was wrapping
  vertically is gone. Header subtitle now reads "总耗时 6.2s · ✓ 已完成"
  via knowledgeStages.total + knowledgeStages.status.<value>.
- Inline row-detail replaced by a right-side drawer absolutely
  positioned inside the timeline shell (overflow: hidden). Clicking a
  row no longer reorders later rows beneath it. The drawer slides via
  CSS transform, ESC and × close it, clicking the selected row toggles.
- Attempt <select> dropdown replaced by a horizontal pill tab strip
  with a per-attempt status glyph (✓ / ✗ / ●). attemptStatuses Map
  populates lazily — one /spans?attempt=n fetch per missing tab on
  mount — so multi-attempt history doesn't block the initial render.

Drawer body shows the rich span data the backend just started
emitting: timing (started / finished / duration), an error block when
failed/cancelled, and Input / Output / Metadata as a 2-column
humanized key/value table. Numbers use toLocaleString, booleans render
as ✓/✗, arrays/objects collapse to a summary with a per-key "Show
JSON" toggle. Existing copySpan retains the raw JSON export path.

Root span name is now localized via knowledgeStages.root ("Knowledge
processing" / "知识处理") instead of leaking the raw identifier
"knowledge_processing" — which previously truncated as
"knowledge_process…".

i18n: added knowledgeStages.total and knowledgeStages.detail.{started,
finished, duration, input, output, metadata, error, showJson, hideJson}
to en-US, zh-CN, ko-KR, ru-RU. Compact mode (used by the card hover
popover) is untouched.
2026-05-28 15:14:45 +08:00
wizardchen
4db6e69b4e fix(knowledge): close root span on terminal state, enrich stage metadata
The root span created by OpenAttempt was never closed: PostProcess only
ended the postprocess stage, so the root row stayed at status=running
forever even after parse_status flipped to completed/failed. The
timeline rendered "进行中" indefinitely on the root, defeating the
whole "is the document done" question the timeline is meant to answer.

- SpanTracker.FinalizeAttempt(kid, attempt, status, output, code, msg):
  closes the root row idempotently. Re-closing a terminal root no-ops
  so success / cascade-fail / dead-letter paths can fire without
  coordination.
- PostProcess.Handle calls FinalizeAttempt(done) after EndSpan(postprocess)
  on the success path. Async downstream work (summary/question/wiki/
  graph) still records its own spans; their completion extends the
  trace's wall-clock end-time but doesn't reopen the root.
- FailSpan auto-closes the root when a MAIN pipeline stage fails
  (docreader / chunking / embedding / multimodal / postprocess).
  Cascade-cancelled siblings stay closed-with-the-cascade as before.
- Dead-letter callback (router/task.go) accepts the SpanTracker via
  DI and calls FinalizeAttempt(failed, TASK_TIMEOUT) when a
  document-related task exhausts retries. The probe payload now
  extracts the Attempt field that Document/Manual/PostProcess
  payloads already thread through.

Stage spans were also being recorded with nil input/output, leaving
the new detail panel with timestamps only. Each Begin/End site now
emits useful work metrics:
- DocReader input: file_name, file_type, is_url, url
              output: text_length, images_found, is_audio, pages
- Chunking input: chunks_planned
            output: chunks_written, total_text_chars
- Embedding input: chunks_to_embed, model_id, dim
             output: vectors_written, storage_bytes
- Multimodal input: image_count, enable_ocr, enable_caption
- PostProcess output: chunks_total, enqueued_summary, _question, _wiki, _graph

i18n: add knowledgeStages.root ("Knowledge processing") so the UI
can render a localized name instead of the raw span identifier.
2026-05-28 15:14:45 +08:00
wizardchen
c5722234ef refactor(frontend): redesign parsing timeline as Langfuse-style waterfall
The five-circle pipeline conveyed only "which stage we're on" — operators
asked for the actual span tree (root → stages → batches/images), the time
each span took relative to the whole run, and the recorded JSON in/out.
This commit rebuilds the visual layer to match Langfuse's trace view
without touching the data layer (polling, attempt selector, retry,
copy-span and watchers all unchanged).

- Three-column grid (name | status+duration | bar): each span is one
  row, indented by depth. Status dot 8px, name monospace, duration
  right-aligned. Bars are absolutely-positioned inside a shared
  timeline lane, left/width computed from each span's
  started_at / duration_ms relative to trace.started_at and the run's
  end (or "now" while running).
- Time ruler with five ticks (0% → 100%) above the bars; total
  duration label on the right.
- Sub-spans (embedding.batch[i], multimodal.image[i]) collapsed
  under each stage by default behind a chevron toggle. Click any
  row to expand an inline detail panel showing ISO timestamps,
  status, error_code/message when failed, and input/output/metadata
  JSON blocks plus a "Copy span JSON" action.
- Removed the parse_status gate in doc-content: the timeline is now
  always rendered when a knowledge detail is open, so completed
  documents also expose their recorded pipeline for audit. The
  KnowledgeBase hover popover stays compact-mode-only and still
  gates on parse_status (popover height matters there).
- Running bars grow visibly via a 1s ticker (cleared on unmount).
- Last-error block restricted to parse_status === failed.
- Added knowledgeStages.totalDuration to en-US / zh-CN / ko-KR / ru-RU.
2026-05-28 15:14:45 +08:00
wizardchen
1d8ac301c2 feat(frontend): visualize document parsing pipeline with span timeline
Renders a five-segment timeline (DocReader → Chunking → Embedding →
Multimodal → PostProcess) above each document so users can see exactly
which stage a document is in and which one failed, instead of a flat
"Parsing..." spinner.

- New KnowledgeProcessingTimeline component (full + compact mode):
  full mode lives at the top of doc-content for any document not yet
  in `completed`; compact mode replaces the parsing/failed lines in
  the card hover popover with a five-dot mini-progress.
- Polls GET /api/v1/knowledge/:id/spans every 2s while
  parse_status in {pending, processing}; stops on terminal status or
  unmount. Hover popover instances opt out of polling to avoid
  amplifying requests on grid hover.
- Failed / cancelled steps expand into an error card with the raw
  error_message, an error_code -> localized title + suggestion mapping
  for the twelve canonical codes (DOCREADER_TIMEOUT, EMBEDDING_RATE_LIMIT,
  ...), and a "Copy details" button that pastes the failed span JSON
  for support tickets.
- Attempt selector renders only when latest_attempt > 1 so reparse
  history is browsable; the latest attempt is selected by default.
- Last-error block under the timeline includes a Retry button that
  calls reparseKnowledge() and immediately re-fetches.
- i18n strings added under knowledgeStages for en-US, zh-CN, ko-KR,
  ru-RU. ja-JP locale file does not exist in this repository.
2026-05-28 15:14:45 +08:00
wizardchen
06f94a4811 feat(migrations): add knowledge_processing_spans table for tracking document parsing progress
- Introduced a new migration (000054) to create the knowledge_processing_spans table, which captures detailed progress of document parsing stages.
- The table includes fields for span hierarchy, status tracking, and metadata to enhance visibility into the parsing pipeline.
- Added indexes to optimize queries related to knowledge_id, attempt, and span relationships.
- Implemented rollback migration to safely drop the table and associated indexes if needed.
2026-05-28 15:14:45 +08:00
wizardchen
c82b098f44 fix(knowledge): address span-tree review feedback
- Collapse migrations 000052 (flat stages) + 000053 (span tree) into a
  single 000052_knowledge_processing_spans migration; the flat stages
  table never escaped this branch and the create-then-drop sequence had
  no value.
- BeginStage: detect an existing (kid, attempt, stage) row before
  inserting and reuse its span_id with reset state, so re-entry from
  asynq retries or adjacent code paths no longer produces duplicate
  timeline segments.
- FailSpan: when sibling-stage cascade flips a dependent stage to
  cancelled, also CancelDescendants on its subtree so already-running
  subspans (embedding.batch[i] etc.) cannot remain as orphan running
  rows under a cancelled parent.
- Dead-letter callback: replace two sequential UpdateKnowledgeColumn
  writes with a single UpdateKnowledgeColumns map update so we cannot
  end up with parse_status=failed and stale error_message (or vice
  versa) when one of the two writes fails.
- touchKnowledgeHeartbeat: skip subspan/generation transitions; only
  root and stage transitions poke knowledge.updated_at. Spans-table
  MAX(updated_at) already covers subspan progress for housekeeping, so
  this avoids 2*N+ UPDATE bursts on the same hot row when a multimodal
  stage fans out to many images.
- Add regression tests for the BeginStage idempotency contract and the
  cascade-into-subspans behaviour.
2026-05-28 15:14:45 +08:00
wizardchen
414749681f test(knowledge): add coverage and protect housekeeping from false-killing long stages
Addresses three review concerns from the prior PR:

  1. No tests existed for any of the stuck-parsing fixes (PR① / PR②.5).
     This commit adds coverage for the four most regression-prone
     surfaces: span repo upsert/cascade/attempt isolation, span
     tracker cascade-cancel and cross-process LookupStage,
     housekeeping false-kill protection, and handler tree assembly.

  2. Housekeeping was using only knowledge.updated_at as its staleness
     signal, but knowledge.updated_at advances only at parse_status
     transitions — a long DocReader call (or large embedding batch)
     can run for an hour with no updated_at change, so a tight
     DocumentProcessTimeout setting would falsely flip an actively
     running parse to "failed".

     The sweep now does a two-stage check: candidates by knowledge
     updated_at, then filtered by MAX(spans.updated_at). Every
     SpanTracker Begin/End/Fail/Skip now also pokes
     knowledge.updated_at as a side-channel heartbeat, so the
     filter sees recent activity even when no parse_status
     transition has fired.

  3. parseHeartbeatTime accepts the timestamp formats both Postgres
     and SQLite emit for an aggregated MAX() column (the SQLite
     driver doesn't auto-cast aggregates to time.Time the way
     Postgres does), so the same code path works in Lite mode.

The new TestHousekeeping_NoFalseKill_ActiveSpan is the regression
test for the user-flagged scenario: a 3-hour-stale knowledge.updated_at
combined with a 2-minute-fresh span row must NOT be killed.
2026-05-28 15:14:45 +08:00
wizardchen
c9941a6688 refactor(knowledge): replace flat stage table with langfuse-style span tree
Addresses review feedback that the PR② design had four shortcomings:

  1. The pipeline is a DAG, not a sequence — Embedding and Multimodal
     are independent of each other, both downstream of Chunking, both
     upstream of PostProcess. The flat (knowledge_id, stage) table
     couldn't represent that, so a Chunking failure left dependents
     stranded as "pending" forever instead of being marked as
     impossible-to-run.

  2. No history across attempts. A reparse erased the previous run's
     status before the new run started, leaving operators with no way
     to investigate "why did this fail twice?".

  3. Stages had only status + duration. Operators want to know how big
     the work was — pages parsed, chunks created, tokens embedded, VLM
     calls made — to distinguish "slow because the file is huge" from
     "slow because the docreader is wedged".

  4. Multimodal fans out N image tasks; Embedding fans out M batches;
     PostProcess fans out into Summary/Question/Wiki/Graph. Each unit
     is interesting on its own (Langfuse already captures this for
     LLM calls). The flat model couldn't express it.

The redesign mirrors Langfuse's trace/span/generation hierarchy:

  * Migration 000053 supersedes 000052: knowledge_processing_spans
    table with (knowledge_id, attempt, span_id) primary key, plus
    parent_span_id, kind ∈ {root, stage, subspan, generation},
    status ∈ {pending,running,done,failed,skipped,cancelled}, and
    JSONB input/output/metadata fields.

  * SpanTracker (replacing StageTracker) exposes OpenAttempt /
    BeginStage / BeginSubSpan / EndSpan / FailSpan / SkipSpan /
    LookupStage. Cross-process workers (image_multimodal) get the
    parent's attempt + span via payload + LookupStage so subspans
    attach correctly.

  * StageDependencies declares the DAG; FailSpan now cascades —
    descendants of the failed span and dependent stages are flipped
    to "cancelled" with a UPSTREAM_FAILED code. The UI sees a clear
    blast radius instead of orphan spinners.

  * Reparse now calls OpenAttempt up front so the timeline reflects
    "new attempt, all pending" instead of letting the previous run's
    status linger until the worker picks up the task.

  * Image_multimodal records each image as a generation subspan with
    its own success/failure on the parent attempt's multimodal stage.
    The finalize-on-last-attempt counter logic is preserved unchanged.

  * GET /api/v1/knowledge/:id/spans (also kept /stages alias) returns
    a tree shape with synthesized pending placeholders so the
    frontend always renders five timeline segments. ?attempt=N
    enables history navigation.
2026-05-28 15:14:45 +08:00
wizardchen
04f56f9cda feat(knowledge): track per-stage parsing progress with /stages API
Adds a five-segment progress model for the document parsing pipeline so
the UI (PR③) can render a timeline showing where each document is
(DocReader → Chunking → Embedding → Multimodal → PostProcess) and
which stage failed with what error code.

- New table `knowledge_processing_stages` (migration 000052) with one
  row per (knowledge_id, stage). UPSERT on Begin/Done/Fail bumps an
  attempt counter so re-parses don't lose history.

- StageTracker service exposes Begin/Done/Fail/Skip; all calls are
  best-effort and never break the pipeline if persistence fails.

- Stable error codes (DOCREADER_TIMEOUT / EMBEDDING_RATE_LIMIT /
  VECTORSTORE_WRITE_FAILED / ...) the UI can map to localized
  remediation hints.

- Tracker call sites added at the four meaningful failure points:
  convert (DocReader), CreateChunks (Chunking), BatchIndex (Embedding),
  enqueueImageMultimodalTasks (Multimodal start),
  KnowledgePostProcess.Handle (Multimodal close + PostProcess).

- New endpoint `GET /api/v1/knowledge/:id/stages` returns the five
  canonical stages — missing rows are synthesized as "pending" so
  the timeline always renders five segments. Includes current_stage
  and last_error block.
2026-05-28 15:14:45 +08:00
wizardchen
3ae3ea97c5 fix(knowledge): prevent documents from getting stuck in "processing"
Several failure modes left Knowledge.parse_status pinned at "processing"
forever, with no signal to users beyond a permanent spinner. This commit
addresses the root causes and adds a safety net.

- Asynq worker pool: explicit Concurrency (default 16, env-tunable via
  WEKNORA_ASYNQ_CONCURRENCY) so batch uploads don't queue behind a
  CPU-count-sized worker pool. Redis op timeouts raised to 500ms/1000ms
  (WEKNORA_REDIS_OP_TIMEOUT_MS) to absorb bursty multimodal counter ops.

- DocReader RPC: cap each call with WEKNORA_DOCREADER_CALL_TIMEOUT
  (default 30m). Without this, a hung docreader pinned a worker for the
  full DocumentProcessTimeout window.

- ImageMultimodal: finalize-on-last-attempt semantics. A permanently
  failing single image no longer strands the parent — the asynq retry
  is allowed to run, but on the final attempt we count the image
  regardless of outcome. Redis DECR errors fall back to enqueuing the
  post-process task instead of returning silently.

- Dead-letter callback: when DocumentProcess / KnowledgePostProcess /
  ManualProcess exhausts retries, immediately mark the corresponding
  Knowledge as failed with the last error. This surfaces the failure
  in the UI without waiting for the housekeeping sweep.

- HousekeepingService: 5-minute cron that flips knowledge rows stuck
  in "processing" past DocumentProcessTimeout + 10m to failed, plus
  summary rows stuck > 1h. Catches anything the other safety nets
  miss (worker SIGKILL mid-handler, etc.). Disable with
  WEKNORA_HOUSEKEEPING_ENABLED=false.

- Distributed startup recovery: previously the post-restart sweep was
  skipped whenever REDIS_ADDR was set, even though Asynq does not
  reschedule the task that was actively running on the dead instance.
  Now the sweep runs in distributed mode too, but only against rows
  older than 30 minutes to avoid racing peer instances.
2026-05-28 15:14:45 +08:00
wizardchen
ae4ec0cf06 fix(im): make presigned URL flow diagnosable end-to-end
When IM image rendering breaks, operators previously had no log line to
correlate against the IM platform's fetch attempt. Add observability
hooks on both ends and unblock HEAD probes so common IM previews work
at all.

- log rewriteStorageURLs success/failure/no-op with the full signed URL
  (operators can copy it from logs and verify public reachability)
- log presigned handler 4xx with client_ip + UA + tenant_id + file_path
  so failures correlate against IM platform fetch logs; use the request
  context so trace IDs are preserved
- accept HEAD on /api/v1/files/presigned: IM platforms (Feishu, Slack
  etc.) probe with HEAD before GET when rendering image previews, and a
  401 there is enough to break the inline image even when the GET would
  have succeeded
- add Admin-only GET /api/v1/files/presigned-preview that returns the
  exact URL an IM channel would embed for the calling tenant, for
  self-service verification without sending a real IM message
- clarify APP_EXTERNAL_URL and MINIO_ENDPOINT public-reachability rules
  in .env.example; misconfigured endpoints are the most common cause of
  "image broken in IM" reports
2026-05-28 08:03:57 +08:00