- Moved the _SEPARATOR_CELL regex definition to a more appropriate location in the Markdown parser.
- Implemented a fallback mechanism in the gRPC document reader to handle cases where the ReadStream RPC is unimplemented, ensuring compatibility with older versions.
- Added a readUnary method to maintain backward compatibility with the legacy unary Read RPC.
- Improved cancellation handling in the MinerUCloud and PaddleOCR-VL readers to prevent excessive API calls during context cancellation.
Introduce opendataloader and PaddleOCR-VL parser engines with tenant-level
settings UI, replace liteparse, and harden Excel/PPT/Markdown parsing.
Optional odl-hybrid sidecar stays local-build only and is excluded from
default dev-start and full profiles.
Large scanned PDFs (hundreds of pages) were slow and fragile end-to-end.
This change addresses the parse, transport, and task-scheduling layers:
docreader (parse + transport):
- Parallelize per-page scanned rendering across processes (forkserver/fork),
with serial fallback. ~4-7x faster on large scanned PDFs; pdfium is not
thread-safe so we fan out across processes. Configurable via
DOCREADER_PDF_RENDER_PARALLELISM.
- Add server-streaming ReadStream RPC: emit one meta frame then one frame per
image, so documents with many page images are no longer capped by the unary
gRPC message-size limit (a 874-page PDF produced ~193MiB of images, far over
the 50MB cap) and memory is bounded on both ends. Unary Read is kept for
backward compatibility; the Go production reader switches to ReadStream.
VLM:
- Make the VLM HTTP timeout configurable (VLM_HTTP_TIMEOUT_SECONDS) and raise
the default 90s -> 180s so dense scanned-page OCR does not time out with
"context deadline exceeded".
Async task queues:
- Isolate high-volume, model-heavy fan-out tasks into dedicated asynq queues so
a single large document cannot saturate the shared worker pool and block
user-facing document parsing:
image:multimodal -> "multimodal"
chunk:extract -> "graph"
question:generation -> "question"
- Register the new queues in the server weight map and the cancel inspector's
scanned-queue set (so cancelling a knowledge still purges its pending tasks).
- Implemented a retry mechanism for DOC to DOCX conversion to handle concurrent `soffice` invocations, ensuring each attempt uses a dedicated user profile directory.
- Added logging for each conversion attempt, including success and failure messages, to improve visibility into the conversion process.
- Adjusted the handling of temporary directories for both conversion output and user profiles, enhancing robustness against conversion failures.
- Added a regex pattern for image file extensions to the utils module for better image detection.
- Updated the BODY_XPATH in the xpaths module to prioritize matching specific content structures in web pages.
- These changes aim to improve the accuracy and efficiency of content extraction from web pages using the StdWebParser class.
- Bump base image in docker/Dockerfile.app from golang:1.24 to golang:1.26
to match `go 1.26` declared in go.mod (fixes CI build failure on
`go mod download`).
- Drop unused docreader components and their dependencies:
- Remove `docreader/ocr/` package (paddle/vlm/dummy backends are
unreferenced by the main flow; OCR/VLM is handled by the Go App).
- Remove `docreader/parser/storage.py` (dead code; image persistence
happens in the Go App via inline ImageRef bytes).
- Remove `docreader/scripts/download_deps.py` (PaddleOCR pre-download).
- Drop deps: paddleocr, paddlepaddle, openai, ollama, minio,
cos-python-sdk-v5, oss2, asyncio, pypdf2, markdown, mistletoe,
goose3, markdownify, pdfplumber, antiword, urllib3.
- Re-lock uv.lock: 145 -> 79 packages.
- Update docreader/README.md to reflect that OCR/VLM/storage are no
longer configured at the docreader level.
The default DOCREADER_DOCX_MAX_PAGES=100 silently truncates large
documents, causing users to see at most ~1000 chunks regardless of
document length. Change the default to 0 (no limit) so all pages are
processed. Operators who need a cap can still set the env var.
Fixes#719
- Introduced PDFScannedParser as a fallback parser for scanned PDFs that converts pages into images for OCR processing.
- Updated PDFParser to include PDFScannedParser in the parsing chain, enhancing the document parsing capabilities for scanned content.
- Improved logging for better error tracking during PDF parsing operations.
- Updated the question generation template to clarify the role of surrounding context and main content.
- Enhanced quality rules for generated questions to better align with user search intent.
- Revised output format and added explicit instructions on what not to generate.
- Improved logging and output in the web parser for better visibility of parsed content and metadata.
- Updated DocReaderServicer to pass metadata in responses.
- Modified PipelineParser to accumulate and merge metadata from all parsers.
- Enhanced StdWebParser to extract and log the title from web page content.
- Implemented logic in knowledge service to update knowledge title based on extracted metadata.
- Updated regex patterns in MarkdownImageUtil to support alt text containing brackets and handle MIME types with hyphens.
- Implemented new functions in ImageResolver for resolving HTML <img> tags with data URIs and bare base64 content, improving image handling in markdown.
- Added comprehensive tests for various image scenarios, ensuring robust handling of data URIs and base64 images.
- Added a new `.env.lite.example` file for the Lite version, providing a minimal configuration template.
- Updated `.env.example` to remove deprecated variables and include new Docreader settings.
- Enhanced Docker configurations to support the Lite version, including a new Dockerfile for the Docreader service.
- Introduced a Makefile target for building and running the Lite version, along with packaging capabilities.
- Created GitHub workflows for building and releasing Lite binaries, including Homebrew formula support.
- Implemented a new service file for managing the Lite version as a system service.
This update enables a streamlined, single-binary deployment of WeKnora, reducing external dependencies and simplifying setup.
- Add docstrings and inline comments for key functions and complex logic
- Unify comment style, eliminate magic numbers and ambiguous variable names
- No functional changes, only improve maintainability