section

Document Pipeline Cluster

Document Pipeline Cluster - KB Documentation

1. Overview

The Document Pipeline handles all PDF extraction, creation, and storage operations in Nexus. It consists of four components working together.

Port Assignments (6650-6653)

Port Component Purpose
6650 Corpus Redis Vault Persistent document data storage
6651 Corpus Redis Operational Read-replica for queries
6652 PDF-Converter MCP Stateless extraction/creation
6653 Nexus Docs (CopyParty) File CDN storage

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DOCUMENT PIPELINE CLUSTER                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Nexus Docs   β”‚    β”‚ PDF-Converter β”‚    β”‚     Corpus       β”‚  β”‚
β”‚  β”‚ (CopyParty)  │───▢│  MCP Server   │───▢│   Redis Store    β”‚  β”‚
β”‚  β”‚ Port: 6653   β”‚    β”‚ Port: 6652*   β”‚    β”‚ Ports: 6650/6651 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚        β”‚                    β”‚                     β–²             β”‚
β”‚        β”‚                    β”‚                     β”‚             β”‚
β”‚        β–Ό                    β–Ό                     β”‚             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚             β”‚
β”‚  β”‚ /data/cdn/   β”‚    β”‚ nlm-ingestor β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚  β”‚ users/{user} β”‚    β”‚ Port: 6752   β”‚                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                                                                  β”‚
β”‚  * PDF-Converter is stateless MCP - no persistent port needed   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

Extraction Flow (PDF β†’ Data):

User File (CDN) β†’ PDF-Converter β†’ LLMSherpa β†’ Structured Data β†’ Corpus

Creation Flow (Data β†’ PDF):

AI Content β†’ PDF-Converter β†’ Staging β†’ User Approval β†’ Nexus Docs (CDN)

2. PDF-Converter Server

Location: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py Type: Stateless MCP (no Redis dependency) Tools: 7

Tool Reference

convert

Extract PDF content and store in Corpus.

pdf-converter.convert(
    source: str,           # File path to PDF
    category: str,         # Category: book, manual, research, report, etc.
    title: str = None,     # Document title (auto-detected if omitted)
    engine: str = "llmsherpa",  # Extraction engine
    extract_images: bool = True,
    send_to_corpus: bool = True,
    chapter_max_level: int = 2,
    section_max_level: int = 4
)

Returns: Document ID, Corpus stable ID, extraction stats

analyze

Preview PDF structure without extracting.

pdf-converter.analyze(
    source: str    # File path to PDF
)

Returns: Section levels, sample content, recommended thresholds

batch

Process multiple PDFs from a folder.

pdf-converter.batch(
    folder: str,        # Folder containing PDFs
    category: str,      # Category for all documents
    pattern: str = "*.pdf",
    engine: str = "llmsherpa"
)

status

Check extraction engine availability.

pdf-converter.status()

Returns: Engine status (llmsherpa, pymupdf, marker)

find_user_file

Find files in user's CDN folder.

pdf-converter.find_user_file(
    username: str,     # Username (e.g., "chris")
    pattern: str,      # Glob pattern (e.g., "*.pdf")
    folder: str = None # Optional subfolder
)

Returns: Matching files with paths, sizes, modification times

create_pdf

Create PDF from markdown and stage for user approval.

pdf-converter.create_pdf(
    title: str,        # PDF title (REQUIRED)
    content: str,      # Markdown content (REQUIRED)
    user_id: str,      # User ID - e.g., "u_z1p5" (REQUIRED)
    category: str = "report"
)

Returns: Staging path, filename, next step instructions

move_to_docs

Move staged PDF to permanent Nexus Docs location.

pdf-converter.move_to_docs(
    staging_path: str,  # Path from create_pdf
    username: str,      # Target username (e.g., "chris")
    folder: str = "documents"
)

Returns: Final path and CDN URL

Extraction Engines

Engine Status Use Case
LLMSherpa βœ… Active Hierarchical structure, tables, sections
PyMuPDF βœ… Active Image extraction
Marker ⏳ Planned Scientific docs with equations

Staging Architecture

All AI-generated PDFs go to staging first:

/data/staging/
└── pdf/
    └── {user_id}/
        └── {Title}_{timestamp}.pdf

Why staging? - Multi-user isolation by user_id - AI creates, user confirms destination - Prevents scattered documents across locations


3. Corpus Integration

PDF-Converter stores extracted content directly to Corpus Redis.

Connection Details

CORPUS_VAULT_PORT = 6650
CORPUS_PASSWORD = "[from locker]"
CORPUS_PREFIX = "corp"

Key Format

corp:{user_id}:{raw_id}

Example: corp:u_z1p5:20260107_1106_9be89c3e

Document Schema

{
    "id": "20260107_1106_9be89c3e",
    "stable_id": "c_9c3e",
    "title": "Document Title",
    "content": "Extracted text content...",
    "category": "research",
    "source_path": "/data/cdn/users/chris/documents/file.pdf",
    "extraction_engine": "llmsherpa",
    "tags": ["extracted", "engine:llmsherpa", "research"],
    "metadata": {
        "sections": 12,
        "tables": 3,
        "chunks": 45,
        "images": 8,
        "total_chars": 25000
    },
    "created_at": "2026-01-07T11:06:51",
    "user": "u_z1p5"
}

Indexes Created

Index Purpose
corp:id_index stable_id β†’ raw_id mapping
corp:stable_index raw_id β†’ stable_id mapping
corp:category:{category} Documents by category
corp:user:{user_id}:docs Documents by user

4. Example Workflows

Workflow A: Extract User's PDF

# 1. Find the file
result = pdf-converter.find_user_file(
    username="chris",
    pattern="*.pdf",
    folder="uploads"
)
# Returns: /data/cdn/users/chris/uploads/contract.pdf

# 2. Analyze structure (optional)
analysis = pdf-converter.analyze(
    source="/data/cdn/users/chris/uploads/contract.pdf"
)
# Returns: Recommended thresholds

# 3. Extract and store in Corpus
extracted = pdf-converter.convert(
    source="/data/cdn/users/chris/uploads/contract.pdf",
    category="contract",
    send_to_corpus=True
)
# Returns: corpus_id, stable_id, extraction stats

Workflow B: Create Research Report

# 1. Create content (AI generates this)
content = """
# Research Findings

## Summary
Analysis of Q4 sales data reveals...

## Key Metrics
- Revenue: $2.4M
- Growth: 15%
"""

# 2. Create PDF (goes to staging)
result = pdf-converter.create_pdf(
    title="Q4 Research Report",
    content=content,
    user_id="u_z1p5",
    category="report"
)
# Returns: staging_path = /data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf

# 3. Ask user where to put it
# AI: "I created your Q4 Research Report. Where would you like me to save it?"
# User: "Put it in my reports folder"

# 4. Move to permanent location
final = pdf-converter.move_to_docs(
    staging_path="/data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf",
    username="chris",
    folder="reports"
)
# Returns: URL = https://docs.corlera.com/home/chris/reports/Q4_Research_Report_20260107.pdf

Workflow C: Batch Process Uploads

# Process all new PDFs in uploads folder
pdf-converter.batch(
    folder="/data/cdn/users/chris/uploads",
    category="uncategorized",
    pattern="*.pdf"
)
# All PDFs extracted and stored in Corpus

5. Configuration Files

PDF-Converter

  • Main: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py
  • Backup: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py.bak
  • nlm-ingestor: Docker container on port 6752
  • Corpus: /opt/mcp-servers/corpus/mcp_corpus_server.py
  • CopyParty: Docker container nexus-cdn on port 6653

Storage Paths

Path Purpose
/data/cdn/users/{username}/ User file storage (CDN)
/data/staging/pdf/{user_id}/ Temporary PDF staging
/data/nexus3/documents/vault/ Corpus Redis data

Documentation created by Agent Indiana (a_jh9b) | KB node by Rocky (o_jugt) | 2026-01-07

ID: 8956de76
Path: Nexus 3.0 Complete Environment Reference > Document Pipeline Cluster
Updated: 2026-01-07T12:07:02