Document Pipeline Cluster - KB Documentation
1. Overview
The Document Pipeline handles all PDF extraction, creation, and storage operations in Nexus. It consists of four components working together.
Port Assignments (6650-6653)
| Port | Component | Purpose |
|---|---|---|
| 6650 | Corpus Redis Vault | Persistent document data storage |
| 6651 | Corpus Redis Operational | Read-replica for queries |
| 6652 | PDF-Converter MCP | Stateless extraction/creation |
| 6653 | Nexus Docs (CopyParty) | File CDN storage |
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCUMENT PIPELINE CLUSTER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β Nexus Docs β β PDF-Converter β β Corpus β β
β β (CopyParty) βββββΆβ MCP Server βββββΆβ Redis Store β β
β β Port: 6653 β β Port: 6652* β β Ports: 6650/6651 β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β β β² β
β β β β β
β βΌ βΌ β β
β ββββββββββββββββ ββββββββββββββββ β β
β β /data/cdn/ β β nlm-ingestor ββββββββββββββ β
β β users/{user} β β Port: 6752 β β
β ββββββββββββββββ ββββββββββββββββ β
β β
β * PDF-Converter is stateless MCP - no persistent port needed β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow
Extraction Flow (PDF β Data):
User File (CDN) β PDF-Converter β LLMSherpa β Structured Data β Corpus
Creation Flow (Data β PDF):
AI Content β PDF-Converter β Staging β User Approval β Nexus Docs (CDN)
2. PDF-Converter Server
Location: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py
Type: Stateless MCP (no Redis dependency)
Tools: 7
Tool Reference
convert
Extract PDF content and store in Corpus.
pdf-converter.convert(
source: str, # File path to PDF
category: str, # Category: book, manual, research, report, etc.
title: str = None, # Document title (auto-detected if omitted)
engine: str = "llmsherpa", # Extraction engine
extract_images: bool = True,
send_to_corpus: bool = True,
chapter_max_level: int = 2,
section_max_level: int = 4
)
Returns: Document ID, Corpus stable ID, extraction stats
analyze
Preview PDF structure without extracting.
pdf-converter.analyze(
source: str # File path to PDF
)
Returns: Section levels, sample content, recommended thresholds
batch
Process multiple PDFs from a folder.
pdf-converter.batch(
folder: str, # Folder containing PDFs
category: str, # Category for all documents
pattern: str = "*.pdf",
engine: str = "llmsherpa"
)
status
Check extraction engine availability.
pdf-converter.status()
Returns: Engine status (llmsherpa, pymupdf, marker)
find_user_file
Find files in user's CDN folder.
pdf-converter.find_user_file(
username: str, # Username (e.g., "chris")
pattern: str, # Glob pattern (e.g., "*.pdf")
folder: str = None # Optional subfolder
)
Returns: Matching files with paths, sizes, modification times
create_pdf
Create PDF from markdown and stage for user approval.
pdf-converter.create_pdf(
title: str, # PDF title (REQUIRED)
content: str, # Markdown content (REQUIRED)
user_id: str, # User ID - e.g., "u_z1p5" (REQUIRED)
category: str = "report"
)
Returns: Staging path, filename, next step instructions
move_to_docs
Move staged PDF to permanent Nexus Docs location.
pdf-converter.move_to_docs(
staging_path: str, # Path from create_pdf
username: str, # Target username (e.g., "chris")
folder: str = "documents"
)
Returns: Final path and CDN URL
Extraction Engines
| Engine | Status | Use Case |
|---|---|---|
| LLMSherpa | β Active | Hierarchical structure, tables, sections |
| PyMuPDF | β Active | Image extraction |
| Marker | β³ Planned | Scientific docs with equations |
Staging Architecture
All AI-generated PDFs go to staging first:
/data/staging/
βββ pdf/
βββ {user_id}/
βββ {Title}_{timestamp}.pdf
Why staging? - Multi-user isolation by user_id - AI creates, user confirms destination - Prevents scattered documents across locations
3. Corpus Integration
PDF-Converter stores extracted content directly to Corpus Redis.
Connection Details
CORPUS_VAULT_PORT = 6650
CORPUS_PASSWORD = "[from locker]"
CORPUS_PREFIX = "corp"
Key Format
corp:{user_id}:{raw_id}
Example: corp:u_z1p5:20260107_1106_9be89c3e
Document Schema
{
"id": "20260107_1106_9be89c3e",
"stable_id": "c_9c3e",
"title": "Document Title",
"content": "Extracted text content...",
"category": "research",
"source_path": "/data/cdn/users/chris/documents/file.pdf",
"extraction_engine": "llmsherpa",
"tags": ["extracted", "engine:llmsherpa", "research"],
"metadata": {
"sections": 12,
"tables": 3,
"chunks": 45,
"images": 8,
"total_chars": 25000
},
"created_at": "2026-01-07T11:06:51",
"user": "u_z1p5"
}
Indexes Created
| Index | Purpose |
|---|---|
corp:id_index |
stable_id β raw_id mapping |
corp:stable_index |
raw_id β stable_id mapping |
corp:category:{category} |
Documents by category |
corp:user:{user_id}:docs |
Documents by user |
4. Example Workflows
Workflow A: Extract User's PDF
# 1. Find the file
result = pdf-converter.find_user_file(
username="chris",
pattern="*.pdf",
folder="uploads"
)
# Returns: /data/cdn/users/chris/uploads/contract.pdf
# 2. Analyze structure (optional)
analysis = pdf-converter.analyze(
source="/data/cdn/users/chris/uploads/contract.pdf"
)
# Returns: Recommended thresholds
# 3. Extract and store in Corpus
extracted = pdf-converter.convert(
source="/data/cdn/users/chris/uploads/contract.pdf",
category="contract",
send_to_corpus=True
)
# Returns: corpus_id, stable_id, extraction stats
Workflow B: Create Research Report
# 1. Create content (AI generates this)
content = """
# Research Findings
## Summary
Analysis of Q4 sales data reveals...
## Key Metrics
- Revenue: $2.4M
- Growth: 15%
"""
# 2. Create PDF (goes to staging)
result = pdf-converter.create_pdf(
title="Q4 Research Report",
content=content,
user_id="u_z1p5",
category="report"
)
# Returns: staging_path = /data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf
# 3. Ask user where to put it
# AI: "I created your Q4 Research Report. Where would you like me to save it?"
# User: "Put it in my reports folder"
# 4. Move to permanent location
final = pdf-converter.move_to_docs(
staging_path="/data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf",
username="chris",
folder="reports"
)
# Returns: URL = https://docs.corlera.com/home/chris/reports/Q4_Research_Report_20260107.pdf
Workflow C: Batch Process Uploads
# Process all new PDFs in uploads folder
pdf-converter.batch(
folder="/data/cdn/users/chris/uploads",
category="uncategorized",
pattern="*.pdf"
)
# All PDFs extracted and stored in Corpus
5. Configuration Files
PDF-Converter
- Main:
/opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py - Backup:
/opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py.bak
Related Services
- nlm-ingestor: Docker container on port 6752
- Corpus:
/opt/mcp-servers/corpus/mcp_corpus_server.py - CopyParty: Docker container
nexus-cdnon port 6653
Storage Paths
| Path | Purpose |
|---|---|
/data/cdn/users/{username}/ |
User file storage (CDN) |
/data/staging/pdf/{user_id}/ |
Temporary PDF staging |
/data/nexus3/documents/vault/ |
Corpus Redis data |
Documentation created by Agent Indiana (a_jh9b) | KB node by Rocky (o_jugt) | 2026-01-07