Content Pipeline

Overview

The content pipeline transforms raw HTML into clean, structured markdown for AI consumption and storage.

Raw HTML → Clean HTML → Convert to Markdown → Structured Output

@mcp.tool()
async def to_markdown(html: str, preserve_links: bool = True) -> str:
    """Convert HTML content to clean markdown."""

Primary conversion tool using markdownify.

@mcp.tool()
async def extract(url: str, selectors: List[str] = None) -> dict:
    """Extract specific content using CSS selectors."""

Targeted extraction for specific page elements.

@mcp.tool()
async def clean(html: str, remove_tags: List[str] = None) -> str:
    """Clean HTML by removing unwanted elements."""

Pre-processing step for complex pages.

Case	Handling
Tables	Converted to markdown tables
Code blocks	Preserved with ``` syntax
Images	`![alt](url)` format
Links	`[text](url)` format
Nested lists	Proper indentation
Special chars	HTML entity decoding
Malformed HTML	BeautifulSoup fixes