Model routing is one of the highest-leverage optimisations available to any business running AI at meaningful volume. Instead of sending every request to the same model — typically the most capable (and expensive) one — model routing intelligently directs each request to the model that is best suited to handle it at the lowest cost. A well-implemented routing layer can cut total AI API costs by 40–70% with no reduction in output quality, by matching each task type to the model tier that handles it reliably and no more.
The Core Principle
Different tasks have fundamentally different requirements. A support ticket classification task — “is this a billing question, a technical issue, or a general enquiry?” — requires accurate pattern matching, not deep reasoning. GPT-4o Mini or Claude Haiku handles this reliably at one-tenth the cost of GPT-4o or Claude Sonnet. A complex contract analysis task requires sophisticated reasoning and nuanced judgment — that is where the premium model earns its price. Routing sends each task type to the appropriate tier rather than applying a single model uniformly.
Building a Simple Router
The simplest router is a rule-based classifier: you define categories of tasks and assign each to a model. Classification and extraction tasks → GPT-4o Mini. Short summarisation → Claude Haiku. Templated generation → GPT-4o Mini. Complex analysis → GPT-4o or Claude Sonnet. The router checks incoming requests against these rules and directs them accordingly. This requires no AI — just a lookup table or switch statement — and is the right starting point before building more sophisticated approaches.
A more sophisticated router uses a cheap, fast model to classify each incoming request and then routes based on the classification. The classifier itself runs on a minimal model (a few hundred tokens, essentially free) and adds negligible latency. This approach handles requests that do not fit clean rule categories by inferring the task type from the content.
Model Routing: Task-to-Tier Mapping
| Task Category | Recommended Model | Typical Cost vs Premium |
|---|---|---|
| Classification / routing | GPT-4o Mini / Haiku | 5–10% |
| Structured extraction | GPT-4o Mini / Haiku | 5–10% |
| Short summarisation | GPT-4o Mini / Haiku | 5–15% |
| Standard content generation | GPT-4o Mini / Haiku | 10–20% |
| Complex analysis / reasoning | GPT-4o / Sonnet | 100% (justified) |
Tools That Support Model Routing
Portkey and LiteLLM both support model routing natively. Portkey’s routing configuration allows conditional routing based on request metadata, request content characteristics, or explicit routing keys in the request. LiteLLM provides a unified interface to multiple models and allows routing rules based on request tags or model aliases. For teams already using Portkey as their AI gateway, model routing is a configuration change; for teams not yet using a gateway, LiteLLM is the lightest-weight path to multi-model routing.
Testing Before Deploying
Before deploying a routing configuration to production, validate that the cheaper model meets your quality threshold for each task category it is assigned to. Run 50 representative examples of each task type through both the premium and cheaper model. Score outputs against defined quality criteria. Only route to the cheaper model for categories where it demonstrably meets your threshold. This empirical validation prevents the false economy of routing to cheaper models that produce outputs requiring expensive human correction.
Monitoring After Deployment
Track quality metrics for each routed category after deployment — not just cost savings. A 50% cost reduction that comes with a 10% increase in output correction rate may not be a net positive when human time costs are included. Monitor cost per useful output (not just cost per API call) and adjust routing rules when quality or efficiency signals indicate a category is misassigned. Good routing configuration is maintained and improved over time, not set once and forgotten.
Validating Quality Before Routing to Cheaper Models
The most common mistake in model routing is routing a task type to a cheaper model based on cost alone, without empirically validating that the cheaper model meets the quality threshold for that specific task. A classification task that GPT-4o Mini handles correctly 97% of the time but that GPT-4o handles correctly 99.5% of the time may seem like a good routing candidate — until you calculate the cost of the 2.5% error rate at production volume. For a workflow processing 10,000 classifications per day, the error rate difference produces 250 additional errors daily. Whether that trade-off is acceptable depends entirely on the cost of those errors, which is specific to your use case.
Build a test set of 100–200 representative inputs, run them through both the premium and cheaper model, and measure the quality difference before deploying any routing configuration. If the quality gap is negligible, route to the cheaper model. If the gap is significant, understand the specific types of inputs where the cheaper model fails — you may be able to route confidently to the cheaper model for 80% of inputs and only fall back to the premium model for the 20% that exhibit characteristics the cheaper model struggles with.
Routing Based on Input Characteristics
The most sophisticated routing configurations route not just by task type but by characteristics of specific inputs. Short inputs with clear structure go to the cheaper model; long inputs with complex requirements go to the premium model. Inputs in the primary language the model was trained on go to the cheaper model; inputs in less common languages go to the premium model. Inputs matching known simple patterns go to the cheaper model; inputs that do not match known patterns go to the premium model for safer handling.
Implementing input-characteristic routing requires a classifier step that analyses each input before routing it. This classifier should itself use a cheap, fast model — a few hundred tokens to analyse the input and return a routing decision costs almost nothing and adds minimal latency. The additional infrastructure complexity is justified when your input distribution has enough variation that a single blanket routing decision leaves significant cost savings or quality improvements on the table.
Monitoring Routing Effectiveness
A routing configuration that is not monitored will drift out of calibration as your input distribution changes and as model capabilities evolve. Monitor two metrics for each routing destination: cost per routed request (to verify the savings are being captured) and quality score for routed requests (to verify quality has not degraded). When either metric moves significantly from baseline, investigate whether the routing rules need adjustment. New input types may appear that do not fit your existing routing categories cleanly. Model updates from providers may change quality characteristics for specific task types. Quarterly routing audits — reviewing cost and quality metrics by routing destination — catch these shifts before they become significant problems.
Implement your first routing rule this week: identify your highest-volume task type and test whether a cheaper model meets your quality threshold. If it does, the cost saving from routing that task type is immediate and ongoing.
Cross-Provider Routing for Reliability
Model routing is not only about cost — it is also a reliability strategy. If you route all requests to a single provider and that provider experiences an outage or rate limit surge, your entire AI capability is affected simultaneously. Cross-provider routing — where different task types route to different providers, or where a secondary provider serves as a fallback for the primary — provides resilience against provider-specific issues. When OpenAI is experiencing elevated latency, traffic routed to Anthropic as a fallback continues serving users with minimal disruption. When Anthropic releases a model update that changes output characteristics, the subset of tasks routed to OpenAI provides a baseline for comparison.
Build cross-provider routing from the start rather than after a provider outage reveals your single-provider dependency. The configuration overhead is minimal — one additional routing rule and one additional API key — and the reliability benefit is immediate. For business-critical AI applications, multi-provider routing is the same type of resilience investment as running on multiple cloud availability zones: inexpensive to implement proactively, expensive to retrofit after an outage causes a business impact.
Documenting Your Routing Configuration
A routing configuration that is not documented is a single point of knowledge that creates operational risk when the person who configured it is unavailable. Document your routing rules clearly: which task types route to which models, the reasoning behind each assignment, the quality test results that validated the routing decision, and the monitoring thresholds that would trigger a routing review. This documentation does not need to be elaborate — a Notion page or a YAML comment block in your LiteLLM configuration with one paragraph per routing rule is sufficient. When a new engineer joins your team, when a provider updates their models and you need to review your routing, or when a quality issue prompts an investigation, the documentation makes the review fast. Without it, every investigation starts from scratch.