Every AI pricing page references tokens, but many business owners pay API bills for months without fully understanding what they are paying for. Tokens are not words, characters, or sentences — they are a specific unit of text that AI models use to process language. Understanding tokens is the foundation of understanding and controlling your AI costs.
The Plain-English Explanation
A token is roughly four characters of text, or about three-quarters of a word in English. The word “business” is one token. The word “entrepreneurship” is two tokens. A space before a word is typically counted as part of the following token. Common short words like “the”, “is”, “of” are each one token. Less common or longer words may be split into multiple tokens.
This matters because AI models — GPT-4o, Claude, Gemini — process text in tokens, not words, and they are priced per token. When you send a 500-word prompt and receive a 300-word response, you are paying for approximately 650 input tokens and 400 output tokens (at roughly 1.3 tokens per word in English). Note that output tokens cost more than input tokens with most providers — typically two to four times more.
Why Non-English Text Costs More
English is the language AI models were primarily trained on, which means English text tends to tokenise efficiently — close to the 4 characters per token average. Other languages are less efficient. Chinese, Japanese, and Korean characters often require one token each, even though a single character may represent a whole word. Arabic script can be similarly token-heavy. If your business communicates in multiple languages, factor this into your cost estimates. The same content in Japanese or Arabic may cost two to three times more to process than in English.
Token Reference Guide
| Content Type | Approximate Token Count |
|---|---|
| 1 word (English) | ~1.3 tokens |
| 1 sentence (~15 words) | ~20 tokens |
| 1 page (~250 words) | ~330 tokens |
| 10-page document (~2,500 words) | ~3,300 tokens |
| A full novel (~80,000 words) | ~110,000 tokens |
Input Tokens vs Output Tokens
Every API interaction has two token costs: input (what you send to the model) and output (what the model sends back). Input includes everything in your request — the system prompt, the user message, any conversation history, any documents or context you include. Output is the model’s response.
Output tokens consistently cost more. For Claude Sonnet 4, input tokens are $3 per million and output tokens are $15 per million — five times more expensive. For GPT-4o, the ratio is similar. This means that long, verbose AI responses are disproportionately expensive. Constraining output length — either through explicit instructions in the prompt or the max_tokens parameter — is one of the most effective cost levers available.
How to Estimate Costs Before You Build
Before building any AI-powered feature, estimate the token cost per interaction and multiply by expected volume. A customer service chatbot handling 1,000 queries per day, each with a 500-token system prompt, 200-token user message, and 300-token response, uses approximately 1,000,000 input tokens and 300,000 output tokens per day. At Claude Haiku pricing, that is approximately $0.35 per day — about $130 per year. The same workflow on Claude Sonnet would cost approximately $3,700 per year. The model choice matters significantly at scale.
Use OpenAI’s tokeniser tool (available at platform.openai.com/tokenizer) or Anthropic’s token counting API to measure your actual prompts before deploying. Estimates based on word count are close enough for initial planning; exact counts matter for production budgeting.
How to Test Models Before Committing
The right way to decide which model to use is empirically, not by assumption. Take 50 representative examples of the task — real inputs from your application or realistic synthetic ones — and run them through both models. Define your quality criteria before you look at the outputs: accuracy, completeness, format adherence, tone. Score each output against those criteria. If the quality gap is within your acceptable range, use the cheaper model. If it is not, use the more expensive one for that task type specifically.
This test takes two to three hours for most task types and gives you durable, data-driven model selection decisions rather than intuitions that change every time someone reads a new benchmark article. Repeat the test when a new model version is released — model capabilities change, and a task that required GPT-4o six months ago may be well within GPT-4o Mini’s capability today.
Hybrid Model Strategies
The most cost-efficient AI applications do not use a single model for everything — they route different task types to the appropriate model. A customer service application might use Claude Haiku for intent classification and ticket routing, GPT-4o Mini for generating standard response templates, and Claude Sonnet only for the subset of queries requiring nuanced analysis or sensitive handling. This tiered approach captures the quality benefits of premium models where they matter while eliminating their cost where they do not.
Implementing routing logic requires knowing which task type a given request falls into — which is itself a classification task well-suited to a cheap model. A two-stage architecture where a fast, cheap model first classifies the request and routes it to the appropriate tier adds negligible latency and cost while enabling significant savings at the workflow level.
Monitoring Quality After Cost Optimisation
Every cost optimisation should be followed by a quality monitoring period. Track output quality metrics — user satisfaction scores, error rates, escalation rates, manual correction frequency — after switching to a cheaper model or trimming a prompt. Most optimisations that are correctly scoped and tested produce no measurable quality change. Occasionally, a change that tested well on 50 samples shows a problem at scale with edge cases you did not anticipate. Catching this early, before it affects thousands of users, requires active monitoring in the first two to four weeks after any significant change.
Tokens Across Different Content Types
Not all content tokenises equally. Code is generally token-efficient because programming languages use short keywords and consistent patterns. Markdown with extensive formatting characters (hashes, asterisks, brackets) uses more tokens than plain text of equivalent information density. Numbers tokenise variably — a short number like “42” may be one token, while a long decimal like “3.14159265” may be four or five. If your application processes large volumes of numerical data, consider whether the model actually needs full precision or whether rounded values would serve the task equally well at lower cost.
Whitespace counts. Extra blank lines, excessive indentation, and unnecessary line breaks all contribute to token count. A prompt that uses clean, compact formatting typically costs 5–15% less than the same prompt with generous whitespace, with no meaningful difference in output quality. This is a small saving per request but compounds at volume.
Estimating Costs Before You Build
Every new AI feature should have a cost estimate before development begins. The estimate does not need to be precise — within 50% is sufficient for planning purposes. Start with the average prompt size (system prompt + user input + any context), estimate the average response length, multiply by the model’s per-token price, and multiply by expected daily request volume. This gives you a daily cost estimate that you can project to monthly and annual figures.
Compare the estimated cost against the business value the feature delivers. If the cost is trivially small relative to the value, proceed without extensive optimisation. If the cost is material, design the feature with efficiency in mind from the start — it is significantly easier to build a cost-efficient feature than to retrofit efficiency into an existing one. The ten minutes spent on a cost estimate before development begins can save days of optimisation work after launch.
Tokens and Data Privacy
Understanding tokens also matters for data privacy. When you send data to an AI API, every token in your request is processed by the provider’s infrastructure. Understanding what you are sending — and therefore what data is leaving your systems — is important for compliance with data privacy obligations. Review your prompts for any data that should not be sent to third-party AI providers: personally identifiable information, confidential business data, data subject to regulatory restrictions. The token-level understanding of what you send gives you the precision to redact or anonymise only what is necessary rather than avoiding AI use altogether for data-sensitive workflows.
Applying This in Your Business This Week
Knowledge without application produces no results. The frameworks, tools, and techniques in this article are only valuable when they are applied to real workflows in your specific business context. Pick the single most expensive or highest-volume AI workflow you currently run. Measure its current cost per call. Apply the most relevant optimisation from this article — whether that is model selection, prompt trimming, caching, output limits, or monitoring. Measure again. Share the result with your team.
That single application will teach you more than reading ten more articles about AI cost optimisation. It will surface the specific constraints of your stack, the trade-offs relevant to your use case, and the levers that actually move the needle for your application. Every subsequent optimisation builds on that foundation of practical experience.