Skip to main content
DocumentService handles the complete document processing workflow: from upload to parsing, chunking, vectorization, and storage.

Supported File Formats

FormatExtensionsParserFallback
PDF.pdfPyPDFLoaderUnstructuredFileLoader
Word.doc, .docxUnstructuredWordDocumentLoaderUnstructuredFileLoader
Excel.xls, .xlsxUnstructuredExcelLoader (elements mode)UnstructuredFileLoader
PowerPoint.ppt, .pptxUnstructuredPowerPointLoaderUnstructuredFileLoader
Markdown.md, .markdownUnstructuredMarkdownLoaderPlain text reading
CSV.csvCSVLoaderPlain text reading
Plain Text.txtDirect reading
Each format has a fallback strategy — if the specialized Loader fails or its dependencies are not installed, the system automatically falls back to a general-purpose Loader. Excel documents are loaded in elements mode with table content tagged, and PowerPoint slides are tagged by slide number.

Processing Pipeline

Detailed Steps

  1. Upload file: Saved locally to data/documents/{user_id}/{kb_id}/{uuid}_{filename}, with a unique filename to prevent conflicts
  2. Create record: Document metadata (fileName, filePath, fileType, fileSize) is created via Next.js API
  3. Trigger processing: Call POST /{kb_id}/documents/{doc_id}/process, executed asynchronously via BackgroundTasks
  4. Load document: The appropriate LangChain Loader is selected based on the file extension
  5. Intelligent chunking: SmartChunker is used preferentially, falling back to RecursiveCharacterTextSplitter
  6. Generate vectors: Uses the user-configured embedding model (create_embeddings(user_id))
  7. Store in pgvector: Each chunk is stored with metadata — user_id, knowledge_base_id, document_id, chunk_index, source, page, created_at
  8. Invalidate BM25: Notifies BM25Store that the index for this knowledge base needs to be rebuilt (auto-built on next retrieval)
  9. Update status: Document status is updated to completed via Next.js API, with chunkCount recorded

Document Status

StatusDescription
pendingUploaded, processing not yet started
processingCurrently being processed (async background task)
completedProcessing complete
failedProcessing failed (errorMessage recorded)

SmartChunker Intelligent Chunking

SmartChunker draws inspiration from RAGFlow’s DeepDoc design, selecting the optimal chunking strategy based on document type to achieve semantically-aware chunking rather than fixed-size splitting.

Chunking Strategies

Document TypeStrategyDescription
MarkdownMarkdownHeaderTextSplitterChunks by heading level (H1–H6), preserving hierarchy metadata
Code filesLanguage-aware splitterChunks by function/class syntax structure, supports 15+ languages
TablesTable integrity preservationTables are kept as independent chunks to avoid being split
PDF / OtherRecursiveCharacterTextSplitterRecursive character chunking + paragraph awareness

Supported Code Languages

Python, JavaScript/TypeScript, Java, Go, C/C++, C#, Ruby, PHP, Rust, Scala, Swift, Kotlin, and more.

Processing Configuration

Processing configuration can be customized at the knowledge base level, retrieved via GET /api/knowledge-base/{id}/config:
ParameterDefaultDescription
chunkSize1000Maximum characters per chunk
chunkOverlap200Overlap characters between chunks
separators["\n\n", "\n", "。", ".", "!", "?", ";", ";", " ", ""]Recursive separators
indexMethodhigh_qualityIndexing method
replaceConsecutiveSpacestrueReplace consecutive spaces
deleteUrlsAndEmailsfalseRemove URLs and email addresses
useSmartChunkingtrueWhether to enable SmartChunker (disabled falls back to traditional chunking)
If configuration retrieval fails, the default configuration is used automatically.

Chunk Preview

A client-side preview endpoint is available for real-time visualization when configuring chunking parameters:
POST /api/knowledge-base/preview-chunks
{
  "content": "Text content to preview...",
  "chunk_size": 1024,
  "chunk_overlap": 50,
  "delimiter": "\n\n"
}
Returns the chunking results, including index, content, and length for each chunk.