Document Processing

DocumentService handles the complete document processing workflow: from upload to parsing, chunking, vectorization, and storage.

Supported File Formats

Format	Extensions	Parser	Fallback
PDF	`.pdf`	PyPDFLoader	UnstructuredFileLoader
Word	`.doc`, `.docx`	UnstructuredWordDocumentLoader	UnstructuredFileLoader
Excel	`.xls`, `.xlsx`	UnstructuredExcelLoader (elements mode)	UnstructuredFileLoader
PowerPoint	`.ppt`, `.pptx`	UnstructuredPowerPointLoader	UnstructuredFileLoader
Markdown	`.md`, `.markdown`	UnstructuredMarkdownLoader	Plain text reading
CSV	`.csv`	CSVLoader	Plain text reading
Plain Text	`.txt`	Direct reading	—

Each format has a fallback strategy — if the specialized Loader fails or its dependencies are not installed, the system automatically falls back to a general-purpose Loader. Excel documents are loaded in elements mode with table content tagged, and PowerPoint slides are tagged by slide number.

Processing Pipeline

Detailed Steps

Upload file: Saved locally to data/documents/{user_id}/{kb_id}/{uuid}_{filename}, with a unique filename to prevent conflicts
Create record: Document metadata (fileName, filePath, fileType, fileSize) is created via Next.js API
Trigger processing: Call POST /{kb_id}/documents/{doc_id}/process, executed asynchronously via BackgroundTasks
Load document: The appropriate LangChain Loader is selected based on the file extension
Intelligent chunking: SmartChunker is used preferentially, falling back to RecursiveCharacterTextSplitter
Generate vectors: Uses the user-configured embedding model (create_embeddings(user_id))
Store in pgvector: Each chunk is stored with metadata — user_id, knowledge_base_id, document_id, chunk_index, source, page, created_at
Invalidate BM25: Notifies BM25Store that the index for this knowledge base needs to be rebuilt (auto-built on next retrieval)
Update status: Document status is updated to completed via Next.js API, with chunkCount recorded

Document Status

Status	Description
`pending`	Uploaded, processing not yet started
`processing`	Currently being processed (async background task)
`completed`	Processing complete
`failed`	Processing failed (errorMessage recorded)

SmartChunker Intelligent Chunking

SmartChunker draws inspiration from RAGFlow’s DeepDoc design, selecting the optimal chunking strategy based on document type to achieve semantically-aware chunking rather than fixed-size splitting.

Chunking Strategies

Document Type	Strategy	Description
Markdown	MarkdownHeaderTextSplitter	Chunks by heading level (H1–H6), preserving hierarchy metadata
Code files	Language-aware splitter	Chunks by function/class syntax structure, supports 15+ languages
Tables	Table integrity preservation	Tables are kept as independent chunks to avoid being split
PDF / Other	RecursiveCharacterTextSplitter	Recursive character chunking + paragraph awareness

Supported Code Languages

Python, JavaScript/TypeScript, Java, Go, C/C++, C#, Ruby, PHP, Rust, Scala, Swift, Kotlin, and more.

Processing Configuration

Processing configuration can be customized at the knowledge base level, retrieved via GET /api/knowledge-base/{id}/config:

Parameter	Default	Description
`chunkSize`	`1000`	Maximum characters per chunk
`chunkOverlap`	`200`	Overlap characters between chunks
`separators`	`["\n\n", "\n", "。", ".", "!", "?", ";", "；", " ", ""]`	Recursive separators
`indexMethod`	`high_quality`	Indexing method
`replaceConsecutiveSpaces`	`true`	Replace consecutive spaces
`deleteUrlsAndEmails`	`false`	Remove URLs and email addresses
`useSmartChunking`	`true`	Whether to enable SmartChunker (disabled falls back to traditional chunking)

If configuration retrieval fails, the default configuration is used automatically.

Chunk Preview

A client-side preview endpoint is available for real-time visualization when configuring chunking parameters:

POST /api/knowledge-base/preview-chunks
{
  "content": "Text content to preview...",
  "chunk_size": 1024,
  "chunk_overlap": 50,
  "delimiter": "\n\n"
}

Returns the chunking results, including index, content, and length for each chunk.

Get Started

Fundamentals

Core Capabilities

Infra

Channels

Document Processing

Supported File Formats

Processing Pipeline

Detailed Steps

Document Status

SmartChunker Intelligent Chunking

Chunking Strategies

Supported Code Languages

Processing Configuration

Chunk Preview

Get Started

Fundamentals

Core Capabilities

Infra

Channels

​Supported File Formats

​Processing Pipeline

​Detailed Steps

​Document Status

​SmartChunker Intelligent Chunking

​Chunking Strategies

​Supported Code Languages

​Processing Configuration

​Chunk Preview

Supported File Formats

Processing Pipeline

Detailed Steps

Document Status

SmartChunker Intelligent Chunking

Chunking Strategies

Supported Code Languages

Processing Configuration

Chunk Preview