All posts
Product4 min read

Rule-Based vs. Neural Compression: When to Use Each

LLMLingua and similar neural approaches achieve higher compression, but at a cost. We explain the trade-offs and when each mode is the right choice.

z

ziptoken

Engineering

ziptoken offers two compression modes: a fast, deterministic rule-based engine and an optional LLMLingua-powered neural mode. Choosing between them depends on your latency budget and compression goals.

Rule-based (default)

  • Latency: <5ms per call
  • Typical savings: 25–45%
  • Best for: High-throughput production workloads, customer-facing products, RAG pipelines

Neural (LLMLingua mode)

  • Latency: 100–400ms per call
  • Typical savings: 55–70%
  • Best for: Batch jobs, offline processing, maximising savings when latency is acceptable

Recommendation

Use rule-based for any user-facing request. Switch to LLMLingua for nightly batch summarisation jobs, document processing pipelines, or fine-tuning dataset preparation where you can afford to wait 200ms extra.

Start compressing your prompts

Free tier β€” 50,000 tokens/month, no credit card required.