Supported File Formats
| Format | Extensions | Parser | Fallback |
|---|---|---|---|
.pdf | PyPDFLoader | UnstructuredFileLoader | |
| Word | .doc, .docx | UnstructuredWordDocumentLoader | UnstructuredFileLoader |
| Excel | .xls, .xlsx | UnstructuredExcelLoader (elements mode) | UnstructuredFileLoader |
| PowerPoint | .ppt, .pptx | UnstructuredPowerPointLoader | UnstructuredFileLoader |
| Markdown | .md, .markdown | UnstructuredMarkdownLoader | Plain text reading |
| CSV | .csv | CSVLoader | Plain text reading |
| Plain Text | .txt | Direct reading | — |
Processing Pipeline
Detailed Steps
- Upload file: Saved locally to
data/documents/{user_id}/{kb_id}/{uuid}_{filename}, with a unique filename to prevent conflicts - Create record: Document metadata (fileName, filePath, fileType, fileSize) is created via Next.js API
- Trigger processing: Call
POST /{kb_id}/documents/{doc_id}/process, executed asynchronously viaBackgroundTasks - Load document: The appropriate LangChain Loader is selected based on the file extension
- Intelligent chunking: SmartChunker is used preferentially, falling back to RecursiveCharacterTextSplitter
- Generate vectors: Uses the user-configured embedding model (
create_embeddings(user_id)) - Store in pgvector: Each chunk is stored with metadata —
user_id,knowledge_base_id,document_id,chunk_index,source,page,created_at - Invalidate BM25: Notifies BM25Store that the index for this knowledge base needs to be rebuilt (auto-built on next retrieval)
- Update status: Document status is updated to
completedvia Next.js API, withchunkCountrecorded
Document Status
| Status | Description |
|---|---|
pending | Uploaded, processing not yet started |
processing | Currently being processed (async background task) |
completed | Processing complete |
failed | Processing failed (errorMessage recorded) |
SmartChunker Intelligent Chunking
SmartChunker draws inspiration from RAGFlow’s DeepDoc design, selecting the optimal chunking strategy based on document type to achieve semantically-aware chunking rather than fixed-size splitting.Chunking Strategies
| Document Type | Strategy | Description |
|---|---|---|
| Markdown | MarkdownHeaderTextSplitter | Chunks by heading level (H1–H6), preserving hierarchy metadata |
| Code files | Language-aware splitter | Chunks by function/class syntax structure, supports 15+ languages |
| Tables | Table integrity preservation | Tables are kept as independent chunks to avoid being split |
| PDF / Other | RecursiveCharacterTextSplitter | Recursive character chunking + paragraph awareness |
Supported Code Languages
Python, JavaScript/TypeScript, Java, Go, C/C++, C#, Ruby, PHP, Rust, Scala, Swift, Kotlin, and more.Processing Configuration
Processing configuration can be customized at the knowledge base level, retrieved viaGET /api/knowledge-base/{id}/config:
| Parameter | Default | Description |
|---|---|---|
chunkSize | 1000 | Maximum characters per chunk |
chunkOverlap | 200 | Overlap characters between chunks |
separators | ["\n\n", "\n", "。", ".", "!", "?", ";", ";", " ", ""] | Recursive separators |
indexMethod | high_quality | Indexing method |
replaceConsecutiveSpaces | true | Replace consecutive spaces |
deleteUrlsAndEmails | false | Remove URLs and email addresses |
useSmartChunking | true | Whether to enable SmartChunker (disabled falls back to traditional chunking) |
Chunk Preview
A client-side preview endpoint is available for real-time visualization when configuring chunking parameters:index, content, and length for each chunk.