Document Pipeline Cluster - KB Documentation

1. Overview

The Document Pipeline handles all PDF extraction, creation, and storage operations in Nexus. It consists of four components working together.

Port Assignments (6650-6653)

Port	Component	Purpose
6650	Corpus Redis Vault	Persistent document data storage
6651	Corpus Redis Operational	Read-replica for queries
6652	PDF-Converter MCP	Stateless extraction/creation
6653	Nexus Docs (CopyParty)	File CDN storage

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    DOCUMENT PIPELINE CLUSTER                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Nexus Docs   │    │ PDF-Converter │    │     Corpus       │  │
│  │ (CopyParty)  │───▶│  MCP Server   │───▶│   Redis Store    │  │
│  │ Port: 6653   │    │ Port: 6652*   │    │ Ports: 6650/6651 │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│        │                    │                     ▲             │
│        │                    │                     │             │
│        ▼                    ▼                     │             │
│  ┌──────────────┐    ┌──────────────┐            │             │
│  │ /data/cdn/   │    │ nlm-ingestor │────────────┘             │
│  │ users/{user} │    │ Port: 6752   │                          │
│  └──────────────┘    └──────────────┘                          │
│                                                                  │
│  * PDF-Converter is stateless MCP - no persistent port needed   │
└─────────────────────────────────────────────────────────────────┘

Data Flow

Extraction Flow (PDF → Data):

User File (CDN) → PDF-Converter → LLMSherpa → Structured Data → Corpus

Creation Flow (Data → PDF):

AI Content → PDF-Converter → Staging → User Approval → Nexus Docs (CDN)

2. PDF-Converter Server

Location: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py Type: Stateless MCP (no Redis dependency) Tools: 7

Tool Reference

convert

Extract PDF content and store in Corpus.

pdf-converter.convert(
    source: str,           # File path to PDF
    category: str,         # Category: book, manual, research, report, etc.
    title: str = None,     # Document title (auto-detected if omitted)
    engine: str = "llmsherpa",  # Extraction engine
    extract_images: bool = True,
    send_to_corpus: bool = True,
    chapter_max_level: int = 2,
    section_max_level: int = 4
)

Returns: Document ID, Corpus stable ID, extraction stats

analyze

Preview PDF structure without extracting.

pdf-converter.analyze(
    source: str    # File path to PDF
)

Returns: Section levels, sample content, recommended thresholds

batch

Process multiple PDFs from a folder.

pdf-converter.batch(
    folder: str,        # Folder containing PDFs
    category: str,      # Category for all documents
    pattern: str = "*.pdf",
    engine: str = "llmsherpa"
)

status

Check extraction engine availability.

pdf-converter.status()

Returns: Engine status (llmsherpa, pymupdf, marker)

find_user_file

Find files in user's CDN folder.

pdf-converter.find_user_file(
    username: str,     # Username (e.g., "chris")
    pattern: str,      # Glob pattern (e.g., "*.pdf")
    folder: str = None # Optional subfolder
)

Returns: Matching files with paths, sizes, modification times

create_pdf

Create PDF from markdown and stage for user approval.

pdf-converter.create_pdf(
    title: str,        # PDF title (REQUIRED)
    content: str,      # Markdown content (REQUIRED)
    user_id: str,      # User ID - e.g., "u_z1p5" (REQUIRED)
    category: str = "report"
)

Returns: Staging path, filename, next step instructions

move_to_docs

Move staged PDF to permanent Nexus Docs location.

pdf-converter.move_to_docs(
    staging_path: str,  # Path from create_pdf
    username: str,      # Target username (e.g., "chris")
    folder: str = "documents"
)

Returns: Final path and CDN URL

Extraction Engines

Engine	Status	Use Case
LLMSherpa	✅ Active	Hierarchical structure, tables, sections
PyMuPDF	✅ Active	Image extraction
Marker	⏳ Planned	Scientific docs with equations

Staging Architecture

All AI-generated PDFs go to staging first:

/data/staging/
└── pdf/
    └── {user_id}/
        └── {Title}_{timestamp}.pdf

Why staging? - Multi-user isolation by user_id - AI creates, user confirms destination - Prevents scattered documents across locations

3. Corpus Integration

PDF-Converter stores extracted content directly to Corpus Redis.

Connection Details

CORPUS_VAULT_PORT = 6650
CORPUS_PASSWORD = "[from locker]"
CORPUS_PREFIX = "corp"

Key Format

corp:{user_id}:{raw_id}

Example: corp:u_z1p5:20260107_1106_9be89c3e

Document Schema

{
    "id": "20260107_1106_9be89c3e",
    "stable_id": "c_9c3e",
    "title": "Document Title",
    "content": "Extracted text content...",
    "category": "research",
    "source_path": "/data/cdn/users/chris/documents/file.pdf",
    "extraction_engine": "llmsherpa",
    "tags": ["extracted", "engine:llmsherpa", "research"],
    "metadata": {
        "sections": 12,
        "tables": 3,
        "chunks": 45,
        "images": 8,
        "total_chars": 25000
    },
    "created_at": "2026-01-07T11:06:51",
    "user": "u_z1p5"
}

Indexes Created

Index	Purpose
`corp:id_index`	stable_id → raw_id mapping
`corp:stable_index`	raw_id → stable_id mapping
`corp:category:{category}`	Documents by category
`corp:user:{user_id}:docs`	Documents by user

4. Example Workflows

Workflow A: Extract User's PDF

# 1. Find the file
result = pdf-converter.find_user_file(
    username="chris",
    pattern="*.pdf",
    folder="uploads"
)
# Returns: /data/cdn/users/chris/uploads/contract.pdf

# 2. Analyze structure (optional)
analysis = pdf-converter.analyze(
    source="/data/cdn/users/chris/uploads/contract.pdf"
)
# Returns: Recommended thresholds

# 3. Extract and store in Corpus
extracted = pdf-converter.convert(
    source="/data/cdn/users/chris/uploads/contract.pdf",
    category="contract",
    send_to_corpus=True
)
# Returns: corpus_id, stable_id, extraction stats

Workflow B: Create Research Report

# 1. Create content (AI generates this)
content = """
# Research Findings

## Summary
Analysis of Q4 sales data reveals...

## Key Metrics
- Revenue: $2.4M
- Growth: 15%
"""

# 2. Create PDF (goes to staging)
result = pdf-converter.create_pdf(
    title="Q4 Research Report",
    content=content,
    user_id="u_z1p5",
    category="report"
)
# Returns: staging_path = /data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf

# 3. Ask user where to put it
# AI: "I created your Q4 Research Report. Where would you like me to save it?"
# User: "Put it in my reports folder"

# 4. Move to permanent location
final = pdf-converter.move_to_docs(
    staging_path="/data/staging/pdf/u_z1p5/Q4_Research_Report_20260107.pdf",
    username="chris",
    folder="reports"
)
# Returns: URL = https://docs.corlera.com/home/chris/reports/Q4_Research_Report_20260107.pdf

Workflow C: Batch Process Uploads

# Process all new PDFs in uploads folder
pdf-converter.batch(
    folder="/data/cdn/users/chris/uploads",
    category="uncategorized",
    pattern="*.pdf"
)
# All PDFs extracted and stored in Corpus

5. Configuration Files

PDF-Converter

Main: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py
Backup: /opt/mcp-servers/pdf-converter/mcp_pdf_converter_server.py.bak

nlm-ingestor: Docker container on port 6752
Corpus: /opt/mcp-servers/corpus/mcp_corpus_server.py
CopyParty: Docker container nexus-cdn on port 6653

Storage Paths

Path	Purpose
`/data/cdn/users/{username}/`	User file storage (CDN)
`/data/staging/pdf/{user_id}/`	Temporary PDF staging
`/data/nexus3/documents/vault/`	Corpus Redis data

Documentation created by Agent Indiana (a_jh9b) | KB node by Rocky (o_jugt) | 2026-01-07