Cleaner Module
The Cleaner module provides operations for cleaning dirty data, including HTML removal, whitespace normalization, Unicode fixing, and JSON compression.
StripHTML
Remove HTML tags from text, with options to preserve semantic tags or convert to Markdown.
prompt_refiner.cleaner.StripHTML
Bases: Refiner
Remove HTML tags from text, with options to preserve semantic tags or convert to Markdown.
Initialize the HTML stripper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preserve_tags
|
Optional[Set[str]]
|
Set of tag names to preserve (e.g., {'p', 'li', 'table'}) |
None
|
to_markdown
|
bool
|
Convert common HTML tags to Markdown syntax |
False
|
Source code in src/prompt_refiner/cleaner/html.py
Functions
process
Remove HTML tags from the input text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text containing HTML |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with HTML tags removed or converted to Markdown |
Source code in src/prompt_refiner/cleaner/html.py
Examples
from prompt_refiner import StripHTML
# Basic HTML stripping
stripper = StripHTML()
result = stripper.process("<p>Hello <b>World</b>!</p>")
# Output: "Hello World!"
# Convert to Markdown
stripper = StripHTML(to_markdown=True)
result = stripper.process("<p>Hello <b>World</b>!</p>")
# Output: "Hello **World**!\n\n"
# Preserve specific tags
stripper = StripHTML(preserve_tags={"p", "div"})
result = stripper.process("<div>Keep <b>Remove</b></div>")
# Output: "<div>Keep Remove</div>"
NormalizeWhitespace
Collapse excessive whitespace, tabs, and newlines into single spaces.
prompt_refiner.cleaner.NormalizeWhitespace
Bases: Refiner
Normalize whitespace in text.
Functions
process
Normalize whitespace by collapsing multiple spaces into one.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with normalized whitespace |
Source code in src/prompt_refiner/cleaner/whitespace.py
Examples
from prompt_refiner import NormalizeWhitespace
normalizer = NormalizeWhitespace()
result = normalizer.process("Hello World \t\n Foo")
# Output: "Hello World Foo"
FixUnicode
Remove problematic Unicode characters including zero-width spaces and control characters.
prompt_refiner.cleaner.FixUnicode
Bases: Refiner
Remove or fix problematic Unicode characters.
Initialize the Unicode fixer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
remove_zero_width
|
bool
|
Remove zero-width spaces and similar characters |
True
|
remove_control_chars
|
bool
|
Remove control characters (except newlines and tabs) |
True
|
Source code in src/prompt_refiner/cleaner/unicode.py
Functions
process
Clean problematic Unicode characters from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with problematic Unicode characters removed |
Source code in src/prompt_refiner/cleaner/unicode.py
Examples
from prompt_refiner import FixUnicode
# Remove zero-width spaces and control chars
fixer = FixUnicode()
result = fixer.process("Hello\u200bWorld\u0000")
# Output: "HelloWorld"
# Only remove zero-width spaces
fixer = FixUnicode(remove_control_chars=False)
result = fixer.process("Hello\u200bWorld")
# Output: "HelloWorld"
JsonCleaner
Clean and minify JSON by removing null values and empty containers.
prompt_refiner.cleaner.JsonCleaner
Bases: Refiner
Cleans and minifies JSON strings. Removes null values, empty containers, and extra whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strip_nulls
|
bool
|
If True, remove null/None values from objects and arrays (default: True) |
True
|
strip_empty
|
bool
|
If True, remove empty dicts, lists, and strings (default: True) |
True
|
Example
from prompt_refiner import JsonCleaner cleaner = JsonCleaner(strip_nulls=True, strip_empty=True)
dirty_json = ''' ... { ... "name": "Alice", ... "age": null, ... "address": {}, ... "tags": [], ... "bio": "" ... } ... ''' result = cleaner.run(dirty_json) print(result)
Use Cases
- RAG Context Compression: Strip nulls/empties from API responses before feeding to LLM
- Cost Optimization: Reduce token count by removing unnecessary JSON structure
- Data Cleaning: Normalize JSON from multiple sources with inconsistent null handling
Initialize JSON cleaner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strip_nulls
|
bool
|
Remove null/None values |
True
|
strip_empty
|
bool
|
Remove empty containers (dict, list, str) |
True
|
Source code in src/prompt_refiner/cleaner/json.py
Functions
process
Process the input JSON (string or object). Returns a minified JSON string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Union[str, Dict, List]
|
JSON string, dict, or list to clean |
required |
Returns:
| Type | Description |
|---|---|
str
|
Minified JSON string with nulls/empties removed |
Note
If input is not valid JSON, returns input unchanged.
Source code in src/prompt_refiner/cleaner/json.py
Examples
from prompt_refiner import JsonCleaner
# Strip nulls and empty containers
cleaner = JsonCleaner(strip_nulls=True, strip_empty=True)
dirty_json = """
{
"name": "Alice",
"age": null,
"address": {},
"tags": [],
"bio": ""
}
"""
result = cleaner.process(dirty_json)
# Output: {"name":"Alice"}
# Only strip nulls, keep empties
cleaner = JsonCleaner(strip_nulls=True, strip_empty=False)
result = cleaner.process(dirty_json)
# Output: {"name":"Alice","address":{},"tags":[],"bio":""}
# Only minify (no cleaning)
cleaner = JsonCleaner(strip_nulls=False, strip_empty=False)
result = cleaner.process(dirty_json)
# Output: {"name":"Alice","age":null,"address":{},"tags":[],"bio":""}
# Works with dict/list inputs too
cleaner = JsonCleaner(strip_nulls=True, strip_empty=True)
data = {"name": "Bob", "tags": [], "age": None}
result = cleaner.process(data)
# Output: {"name":"Bob"}
Common Use Cases
Web Scraping
from prompt_refiner import Refiner, StripHTML, NormalizeWhitespace, FixUnicode
web_cleaner = (
Refiner()
.pipe(StripHTML(to_markdown=True))
.pipe(FixUnicode())
.pipe(NormalizeWhitespace())
)
Text Normalization
from prompt_refiner import Refiner, NormalizeWhitespace, FixUnicode
text_normalizer = (
Refiner()
.pipe(FixUnicode())
.pipe(NormalizeWhitespace())
)