section

Content Pipeline

Content Pipeline

Overview

The content pipeline transforms raw HTML into clean, structured markdown for AI consumption and storage.

Technology Stack

Primary Libraries

  • markdownify - HTML to Markdown conversion (installed)
  • BeautifulSoup4 - HTML parsing and cleaning (installed)

Why These Libraries?

  • Both are already installed in the Nexus environment
  • markdownify handles complex HTML structures well
  • BeautifulSoup provides robust cleaning capabilities
  • Together they handle edge cases reliably

Pipeline Stages

Raw HTML → Clean HTML → Convert to Markdown → Structured Output

Stage 1: HTML Cleaning

  • Remove scripts and styles
  • Strip unwanted tags (ads, navigation)
  • Normalize whitespace
  • Handle malformed HTML

Stage 2: Markdown Conversion

  • Preserve semantic structure (headings, lists)
  • Convert tables to markdown format
  • Handle code blocks with syntax highlighting
  • Convert images to markdown syntax
  • Preserve links with proper formatting

Stage 3: Output Structuring

  • Extract title/heading
  • Identify main content
  • Capture metadata (author, date, etc.)
  • Format for AI consumption

Tools

web.to_markdown

@mcp.tool()
async def to_markdown(html: str, preserve_links: bool = True) -> str:
    """Convert HTML content to clean markdown."""

Primary conversion tool using markdownify.

web.extract

@mcp.tool()
async def extract(url: str, selectors: List[str] = None) -> dict:
    """Extract specific content using CSS selectors."""

Targeted extraction for specific page elements.

web.clean

@mcp.tool()
async def clean(html: str, remove_tags: List[str] = None) -> str:
    """Clean HTML by removing unwanted elements."""

Pre-processing step for complex pages.

Edge Cases Tested

Case Handling
Tables Converted to markdown tables
Code blocks Preserved with ``` syntax
Images ![alt](url) format
Links [text](url) format
Nested lists Proper indentation
Special chars HTML entity decoding
Malformed HTML BeautifulSoup fixes

Integration Points

  • Receives HTML from spider_rs scrape
  • Outputs to temp environment for staging
  • Final results stored in KB or appropriate destination
ID: beeb6193
Path: Web Intelligence > Architecture > Content Pipeline
Updated: 2026-01-08T12:35:14