Skip to content

HTML Cleaning Example

Clean HTML content from web scraping or user input.

Scenario

You've scraped content from a website and need to clean it before sending to an LLM API.

Example Code

from prompt_refiner import StripHTML, NormalizeWhitespace

html_content = """
<div class="article">
    <h1>Understanding <strong>LLMs</strong></h1>
    <p>Large Language Models are powerful <em>AI systems</em>.</p>
</div>
"""

# Remove all HTML and normalize whitespace
pipeline = (
    StripHTML()
    | NormalizeWhitespace()
)

cleaned = pipeline.run(html_content)
print(cleaned)
# Output: "Understanding LLMs Large Language Models are powerful AI systems."

Converting to Markdown

# Convert HTML to Markdown instead of removing
pipeline = (
    StripHTML(to_markdown=True)
    | NormalizeWhitespace()
)

markdown = pipeline.run(html_content)
print(markdown)
# Output:
# # Understanding **LLMs**
#
# Large Language Models are powerful *AI systems*.

Full Example

See the complete example: examples/cleaner/html_cleaning.py

python examples/cleaner/html_cleaning.py