Skip to content

Deduplication Example

Remove duplicate content from RAG retrieval results.

Scenario

Your RAG system retrieved multiple similar chunks that contain overlapping information.

Example Code

from prompt_refiner import Deduplicate

# RAG results with duplicates
rag_results = """
Python is a high-level programming language.

Python is a high level programming language.

Python supports multiple programming paradigms.
"""

pipeline = Deduplicate(similarity_threshold=0.85)
deduplicated = pipeline.run(rag_results)

print(deduplicated)
# Output: Only unique paragraphs remain

Adjusting Sensitivity

# More aggressive (70% similarity)
pipeline = Deduplicate(similarity_threshold=0.70)

# Sentence-level deduplication
pipeline = Deduplicate(granularity="sentence")

Performance Considerations

When working with large RAG contexts, keep these performance tips in mind:

Choosing a Similarity Method

# Fast: Jaccard (word-based) - recommended for most use cases
pipeline = Deduplicate(method="jaccard")

# Precise but slower: Levenshtein (character-based)
# Only use when you need character-level accuracy
pipeline = Deduplicate(method="levenshtein")

Scaling with Input Size

The deduplication algorithm compares each chunk against all previous chunks (O(n²)):

  • 10-50 chunks: Fast with either method (typical RAG use case)
  • 50-200 chunks: Use Jaccard for better performance
  • 200+ chunks: Use granularity="paragraph" to reduce chunk count
# For large documents: use paragraph granularity
pipeline = Deduplicate(
    similarity_threshold=0.85,
    method="jaccard",
    granularity="paragraph"  # Fewer chunks = faster
)

Full Example

See: examples/compressor/deduplication.py