Skip to content

Compressor Module

The Compressor module provides operations for reducing text size through smart truncation and deduplication.

TruncateTokens

Truncate text to a maximum number of tokens with intelligent sentence boundary detection.

prompt_refiner.compressor.TruncateTokens

TruncateTokens(
    max_tokens,
    strategy="head",
    respect_sentence_boundary=True,
)

Bases: Refiner

Truncate text to a maximum number of tokens with intelligent sentence boundary detection.

Initialize the truncation operation.

Parameters:

Name Type Description Default
max_tokens int

Maximum number of tokens to keep

required
strategy Literal['head', 'tail', 'middle_out']

Truncation strategy: - "head": Keep the beginning of the text - "tail": Keep the end of the text (useful for conversation history) - "middle_out": Keep beginning and end, remove middle

'head'
respect_sentence_boundary bool

If True, truncate at sentence boundaries

True
Source code in src/prompt_refiner/compressor/truncate.py
def __init__(
    self,
    max_tokens: int,
    strategy: Literal["head", "tail", "middle_out"] = "head",
    respect_sentence_boundary: bool = True,
):
    """
    Initialize the truncation operation.

    Args:
        max_tokens: Maximum number of tokens to keep
        strategy: Truncation strategy:
            - "head": Keep the beginning of the text
            - "tail": Keep the end of the text (useful for conversation history)
            - "middle_out": Keep beginning and end, remove middle
        respect_sentence_boundary: If True, truncate at sentence boundaries
    """
    self.max_tokens = max_tokens
    self.strategy = strategy
    self.respect_sentence_boundary = respect_sentence_boundary

Functions

process
process(text)

Truncate text to max_tokens.

Parameters:

Name Type Description Default
text str

The input text

required

Returns:

Type Description
str

Truncated text respecting sentence boundaries if configured

Source code in src/prompt_refiner/compressor/truncate.py
def process(self, text: str) -> str:
    """
    Truncate text to max_tokens.

    Args:
        text: The input text

    Returns:
        Truncated text respecting sentence boundaries if configured
    """
    estimated_tokens = self._estimate_tokens(text)

    if estimated_tokens <= self.max_tokens:
        return text

    if self.respect_sentence_boundary:
        sentences = self._split_sentences(text)

        if self.strategy == "head":
            return self._truncate_head_sentences(sentences)
        elif self.strategy == "tail":
            return self._truncate_tail_sentences(sentences)
        elif self.strategy == "middle_out":
            return self._truncate_middle_out_sentences(sentences)
    else:
        # Fallback to word-based truncation
        words = text.split()

        if self.strategy == "head":
            return " ".join(words[: self.max_tokens])
        elif self.strategy == "tail":
            return " ".join(words[-self.max_tokens :])
        elif self.strategy == "middle_out":
            half = self.max_tokens // 2
            start_words = words[:half]
            end_words = words[-(self.max_tokens - half) :]
            return " ".join(start_words) + " ... " + " ".join(end_words)

    return text

Truncation Strategies

  • head: Keep the beginning of the text (default)
  • tail: Keep the end of the text (useful for conversation history)
  • middle_out: Keep beginning and end, remove middle

Examples

from prompt_refiner import TruncateTokens

# Keep first 100 tokens
truncator = TruncateTokens(max_tokens=100, strategy="head")
result = truncator.process(long_text)

# Keep last 100 tokens
truncator = TruncateTokens(max_tokens=100, strategy="tail")
result = truncator.process(long_text)

# Keep first and last 50 tokens, remove middle
truncator = TruncateTokens(max_tokens=100, strategy="middle_out")
result = truncator.process(long_text)

# Truncate at word boundaries (faster, less precise)
truncator = TruncateTokens(
    max_tokens=100,
    strategy="head",
    respect_sentence_boundary=False
)
result = truncator.process(long_text)

Deduplicate

Remove duplicate or highly similar text chunks, useful for RAG contexts.

prompt_refiner.compressor.Deduplicate

Deduplicate(
    similarity_threshold=0.85,
    method="jaccard",
    granularity="paragraph",
)

Bases: Refiner

Remove duplicate or highly similar text chunks (useful for RAG contexts).

Performance Characteristics

This operation uses an O(n²) comparison algorithm, where each chunk is compared against all previously seen chunks. The total complexity is O(n² × comparison_cost), where comparison_cost depends on the selected similarity method: - Jaccard: O(m) where m is the chunk length (word-based) - Levenshtein: O(m₁ × m₂) where m₁, m₂ are the chunk lengths (character-based)

For typical RAG contexts (10-50 chunks), performance is acceptable with either method. For larger inputs (200+ chunks), consider using paragraph granularity to reduce the number of comparisons, or use Jaccard method for better performance.

Initialize the deduplication operation.

Parameters:

Name Type Description Default
similarity_threshold float

Threshold for considering text similar (0.0-1.0)

0.85
method Literal['levenshtein', 'jaccard']

Similarity calculation method - "jaccard": Jaccard similarity (word-based, faster) * Complexity: O(m) per comparison where m is chunk length * Recommended for most use cases (10-200 chunks) * Fast even with long chunks - "levenshtein": Levenshtein distance (character-based) * Complexity: O(m₁ × m₂) per comparison * More precise but computationally expensive * Can be slow with long chunks (1000+ characters)

'jaccard'
granularity Literal['sentence', 'paragraph']

Text granularity to deduplicate at - "sentence": Deduplicate at sentence level * More comparisons (more chunks) but smaller chunk sizes * Better for fine-grained deduplication - "paragraph": Deduplicate at paragraph level * Fewer comparisons but larger chunk sizes * Recommended for large documents to reduce n² scaling

'paragraph'
Source code in src/prompt_refiner/compressor/deduplicate.py
def __init__(
    self,
    similarity_threshold: float = 0.85,
    method: Literal["levenshtein", "jaccard"] = "jaccard",
    granularity: Literal["sentence", "paragraph"] = "paragraph",
):
    """
    Initialize the deduplication operation.

    Args:
        similarity_threshold: Threshold for considering text similar (0.0-1.0)
        method: Similarity calculation method
            - "jaccard": Jaccard similarity (word-based, faster)
                * Complexity: O(m) per comparison where m is chunk length
                * Recommended for most use cases (10-200 chunks)
                * Fast even with long chunks
            - "levenshtein": Levenshtein distance (character-based)
                * Complexity: O(m₁ × m₂) per comparison
                * More precise but computationally expensive
                * Can be slow with long chunks (1000+ characters)
        granularity: Text granularity to deduplicate at
            - "sentence": Deduplicate at sentence level
                * More comparisons (more chunks) but smaller chunk sizes
                * Better for fine-grained deduplication
            - "paragraph": Deduplicate at paragraph level
                * Fewer comparisons but larger chunk sizes
                * Recommended for large documents to reduce n² scaling
    """
    self.similarity_threshold = similarity_threshold
    self.method = method
    self.granularity = granularity

Functions

process
process(text)

Remove duplicate or similar text chunks.

Parameters:

Name Type Description Default
text str

The input text

required

Returns:

Type Description
str

Text with duplicates removed

Performance Note

This method uses O(n²) comparisons where n is the number of chunks. For large inputs (200+ chunks), consider using paragraph granularity to reduce the number of chunks, or ensure you're using the jaccard method for better performance.

Source code in src/prompt_refiner/compressor/deduplicate.py
def process(self, text: str) -> str:
    """
    Remove duplicate or similar text chunks.

    Args:
        text: The input text

    Returns:
        Text with duplicates removed

    Performance Note:
        This method uses O(n²) comparisons where n is the number of chunks.
        For large inputs (200+ chunks), consider using paragraph granularity
        to reduce the number of chunks, or ensure you're using the jaccard
        method for better performance.
    """
    chunks = self._split_text(text)

    if not chunks:
        return text

    # Keep track of unique chunks
    unique_chunks = []
    seen_chunks = []

    for chunk in chunks:
        is_duplicate = False

        # Check similarity with all previously seen chunks
        for seen_chunk in seen_chunks:
            similarity = self._calculate_similarity(chunk, seen_chunk)
            if similarity >= self.similarity_threshold:
                is_duplicate = True
                break

        if not is_duplicate:
            unique_chunks.append(chunk)
            seen_chunks.append(chunk)

    # Reconstruct text
    if self.granularity == "paragraph":
        return "\n\n".join(unique_chunks)
    else:  # sentence
        return " ".join(unique_chunks)

Similarity Methods

  • jaccard: Jaccard similarity (word-based, faster) - default
  • levenshtein: Levenshtein distance (character-based, more accurate)

Granularity Levels

  • paragraph: Deduplicate at paragraph level (split by \n\n) - default
  • sentence: Deduplicate at sentence level (split by ., !, ?)

Examples

from prompt_refiner import Deduplicate

# Basic deduplication (85% similarity threshold)
deduper = Deduplicate(similarity_threshold=0.85)
result = deduper.process(text_with_duplicates)

# More aggressive (70% similarity)
deduper = Deduplicate(similarity_threshold=0.70)
result = deduper.process(text_with_duplicates)

# Character-level similarity
deduper = Deduplicate(
    similarity_threshold=0.85,
    method="levenshtein"
)
result = deduper.process(text_with_duplicates)

# Sentence-level deduplication
deduper = Deduplicate(
    similarity_threshold=0.85,
    granularity="sentence"
)
result = deduper.process(text_with_duplicates)

Common Use Cases

RAG Context Optimization

from prompt_refiner import Refiner, Deduplicate, TruncateTokens

rag_optimizer = (
    Refiner()
    .pipe(Deduplicate(similarity_threshold=0.85))  # Remove duplicates first
    .pipe(TruncateTokens(max_tokens=2000))        # Then fit in context window
)

Conversation History Compression

from prompt_refiner import Refiner, Deduplicate, TruncateTokens

conversation_compressor = (
    Refiner()
    .pipe(Deduplicate(granularity="sentence"))
    .pipe(TruncateTokens(max_tokens=1000, strategy="tail"))  # Keep recent messages
)

Document Summarization Prep

from prompt_refiner import Refiner, Deduplicate, TruncateTokens

summarization_prep = (
    Refiner()
    .pipe(Deduplicate(similarity_threshold=0.90))  # Remove near-duplicates
    .pipe(TruncateTokens(max_tokens=4000, strategy="middle_out"))  # Keep intro + conclusion
)