Content Pipeline
Overview
The content pipeline transforms raw HTML into clean, structured markdown for AI consumption and storage.
Technology Stack
Primary Libraries
- markdownify - HTML to Markdown conversion (installed)
- BeautifulSoup4 - HTML parsing and cleaning (installed)
Why These Libraries?
- Both are already installed in the Nexus environment
- markdownify handles complex HTML structures well
- BeautifulSoup provides robust cleaning capabilities
- Together they handle edge cases reliably
Pipeline Stages
Raw HTML → Clean HTML → Convert to Markdown → Structured Output
Stage 1: HTML Cleaning
- Remove scripts and styles
- Strip unwanted tags (ads, navigation)
- Normalize whitespace
- Handle malformed HTML
Stage 2: Markdown Conversion
- Preserve semantic structure (headings, lists)
- Convert tables to markdown format
- Handle code blocks with syntax highlighting
- Convert images to markdown syntax
- Preserve links with proper formatting
Stage 3: Output Structuring
- Extract title/heading
- Identify main content
- Capture metadata (author, date, etc.)
- Format for AI consumption
Tools
web.to_markdown
@mcp.tool()
async def to_markdown(html: str, preserve_links: bool = True) -> str:
"""Convert HTML content to clean markdown."""
Primary conversion tool using markdownify.
web.extract
@mcp.tool()
async def extract(url: str, selectors: List[str] = None) -> dict:
"""Extract specific content using CSS selectors."""
Targeted extraction for specific page elements.
web.clean
@mcp.tool()
async def clean(html: str, remove_tags: List[str] = None) -> str:
"""Clean HTML by removing unwanted elements."""
Pre-processing step for complex pages.
Edge Cases Tested
| Case | Handling |
|---|---|
| Tables | Converted to markdown tables |
| Code blocks | Preserved with ``` syntax |
| Images |  format |
| Links | [text](url) format |
| Nested lists | Proper indentation |
| Special chars | HTML entity decoding |
| Malformed HTML | BeautifulSoup fixes |
Integration Points
- Receives HTML from spider_rs scrape
- Outputs to temp environment for staging
- Final results stored in KB or appropriate destination