When a model's job is to decide something, the decision is a few characters. The sentence wrapped around it is overhead you pay for on every single call, and at volume that overhead dominates the bill.
Before -- a support-ticket router that answers in prose:
Prompt: Which team should handle this ticket?
Response: Based on the description, this ticket is about a
failed credit-card charge, so it should be routed to the
Billing team for further investigation.
That is roughly 35 output tokens, and your code still has to string-match "Billing" out of it.
After -- constrain the output to an enum:
Prompt: Route this ticket. Reply with exactly one of:
BILLING | TECH | ACCOUNT | OTHER. Output only the code.
Response: BILLING
That is one token. Same decision, a fraction of the cost, and nothing to parse.
The same pattern applies anywhere the model picks from a known set: returning a row id instead of restating the record, a category code instead of a description, a true/false instead of "Yes, this appears to be...". If you have product records, ask for the SKU, not the product name and blurb.
Why it works: output tokens are the expensive half of most pricing, often several times the input rate, and they are generated one at a time, so they also drive latency. Collapsing a decision into a single enum or ID attacks both. Giving the model a closed list also reduces the chance it invents a category, so accuracy usually holds or improves.
A few practices that keep it reliable:
- Spell out the allowed values explicitly and say "output only the code" so no preamble sneaks back in.
- Use values that are single tokens where you can (short uppercase words beat long hyphenated phrases).
- If you genuinely need a reason, request it as a separate cheap field or a follow-up call rather than baking prose into every response.