Every token in your prompt costs money and uses context window space. Prompt compression — the practice of reducing prompt length without sacrificing the information the model needs — is a systematic cost optimisation that most teams apply ad hoc (if at all) but could apply methodically to capture meaningful savings across their highest-volume workflows.
Why Prompts Get Fat
Prompts grow through iteration. You start with a brief instruction. An edge case produces a bad output, so you add a clarifying sentence. Another edge case appears, so you add another. A colleague suggests adding more context. Before long, you have a 2,000-token system prompt that could achieve the same results in 600 tokens with deliberate compression. The accretion is natural; the compression requires deliberate effort.
Compression Technique 1: Remove Redundant Instructions
Read every instruction in your prompt and ask: does this change the output if I remove it? Many prompts contain instructions that the model would follow anyway without being told — “be accurate”, “be helpful”, “respond in English” (when the user’s message is in English). Instructions about default behaviours add no value and cost tokens. Test removal against 20 representative inputs before deleting permanently.
Compression Technique 2: Reduce Examples
Few-shot examples are valuable but token-expensive. A prompt with five examples may perform nearly as well with two carefully chosen, maximally representative examples. Identify the two examples that cover the most important pattern variations for your task. Use those and remove the rest. Measure quality impact on a test set before and after.
Compression Technique 3: Compress Prose Instructions to Structured Format
Prose instructions are less token-efficient than structured formats. Compare: “Your job is to analyse the customer feedback and determine whether the sentiment is positive, negative, or neutral. You should also identify the main topic that the feedback is about.” versus “Task: Classify feedback. Output: {sentiment: positive|neutral|negative, topic: string}.” The structured version conveys the same instruction in roughly half the tokens and typically produces more consistent output.
Prompt Compression: Techniques and Typical Savings
| Technique | Typical Token Reduction | Quality Risk |
|---|---|---|
| Remove redundant instructions | 10–30% | Very Low |
| Reduce few-shot examples | 20–50% | Low (test first) |
| Structured vs prose format | 20–40% | Low–Medium |
| Context window pruning | 30–70% | Low if done carefully |
Using AI to Compress Your Prompts
Meta-prompting — asking AI to compress your prompts — is an underused technique. Share your current prompt with Claude and ask: “Compress this prompt to reduce token count by 40% without losing the constraints and instructions that actually affect output quality. Remove anything redundant or that the model would infer without being told. Maintain the most important instructions in the most token-efficient format.” Review the compressed version, test it against your quality benchmark, and iterate if needed. This process consistently surfaces compression opportunities that manual review misses.
Compression Technique 4: Context Window Pruning
In multi-turn conversations or workflows where context accumulates across multiple steps, the context window can grow to include information that was relevant in earlier steps but is no longer needed. Context window pruning removes this stale content before each API call. In a customer service workflow that has exchanged ten messages, the first three exchanges about account verification may be irrelevant to the current question about a billing dispute — pruning them reduces token consumption and focuses the model’s attention on the current issue.
Implement pruning with a summarisation step: when context length exceeds a threshold, compress older turns into a brief summary and remove the originals from the active context. “Messages 1–5 summary: Customer verified account ownership, confirmed email address, and asked about subscription pricing.” This summary retains the important information in a fraction of the tokens. The pruning step costs tokens to run but saves significantly more than it costs when the conversations involved are long.
Measuring the Impact of Compression
Before and after any compression project, measure three things: average input tokens per request, average output tokens per request, and output quality score (from your evaluation rubric or a sample review). The first two tell you the cost impact; the third tells you whether quality was preserved. A compression that reduces token usage by 40% but reduces quality scores by 15% may not be a net positive when the human correction time required to fix the lower-quality outputs is included. Measure all three before committing compression changes to production.
Track compression metrics over time as well as at the point of implementation. Prompts that have been through a compression pass tend to drift back toward verbosity as new instructions are added for new edge cases. A quarterly compression review — looking at your highest-volume prompts for opportunities to re-compress — maintains the efficiency gains over time rather than allowing them to erode through prompt accumulation.
When Not to Compress
Not all prompts benefit from compression. Short prompts that are already efficient — a system prompt of 200 tokens for a simple classification task — are not worth compressing further. The effort of optimising an already-efficient prompt produces negligible savings. Prompts where every instruction is genuinely load-bearing — where removing any instruction produces a measurable quality drop — should not be compressed without adding different content. And prompts where quality is marginal and the task is high-stakes should be improved through better instructions rather than compressed — compression optimises for cost, not quality. Apply compression where there is genuine fat to cut and the task has sufficient volume to justify the optimisation effort.
Run your highest-volume prompt through the three compression techniques this week. Measure before and after. The token saving — and the cost reduction it produces — starts with the very first API call after the optimised prompt goes live.
Automating Compression With Meta-Prompting
Meta-prompting — using AI to optimise your prompts — applies cleanly to compression. Share your current prompt with Claude and ask: “Audit this prompt for token efficiency. Identify every instruction the model would follow without being explicitly told, every redundant phrase, and every section that could be expressed more concisely without losing meaning. Rewrite it at maximum compression while preserving all functional constraints.” The AI’s compression suggestions are a starting point — review them and restore any instruction you believe is genuinely load-bearing before testing.
This process works best as a structured exercise rather than an ad hoc request. Run every high-volume prompt through a compression audit before it goes to production, and again every three to six months as the prompt accumulates additions from edge case handling. Prompts that have not been through a compression pass in over six months are typically 30–50% longer than necessary — accumulated from incremental additions with no corresponding removals.
Caching as a Complement to Compression
Prompt compression reduces the cost of each API call by reducing input token count. Caching eliminates the cost of repeated calls entirely by returning stored results for identical or semantically similar inputs. The two techniques are complementary: compress prompts to reduce per-call cost, and implement caching to eliminate redundant calls altogether. Together, they typically reduce total API costs by 50–70% for production workflows without any change to output quality.
Semantic caching — available through Portkey and similar AI gateways — matches incoming requests to cached results based on meaning rather than exact character match. A question about “how do I reset my password?” matches a cached answer for “where can I change my password?” because they are semantically equivalent. For customer service workflows, knowledge base Q&A, and other applications where similar questions recur frequently, semantic caching captures savings that exact-match caching misses.
Measuring Compression Impact on Quality
Before deploying a compressed prompt to production, validate that compression did not inadvertently remove instructions that were doing real work. Run both the original and compressed prompts against your quality evaluation set — the same fifty representative inputs you used to validate the original prompt — and compare pass rates. A compression that produces equivalent pass rates has successfully removed redundancy without losing functional content. A compression that reduces pass rates has removed something that was genuinely affecting output quality, even if it appeared redundant. Identify which removed instruction caused the regression and restore it in the compressed version. This test is fast to run and is the only reliable way to confirm that your compression preserved all the functional content of the original prompt.
The businesses that build genuine AI capability over time are those that treat each deployment as a learning opportunity — measuring what works, understanding what does not, and applying those lessons to the next implementation. That iterative discipline, applied consistently across your AI portfolio, produces compounding improvements in quality, reliability, and business impact that no single optimal deployment decision can match.