On almost every major API, output tokens cost several times more than input tokens. As of early 2026, Claude Sonnet bills output at roughly 5x its input rate, and GPT-class models commonly sit in the 3-4x range. That means a model that rambles is burning your most expensive token type.
max_tokens is your hard ceiling: the model physically cannot emit more than that many output tokens in one response. Many people leave it at the SDK default (often the model's full window), so a single confused call can generate thousands of tokens you never wanted.
Before (no cap, open-ended cost):
resp = client.messages.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Classify this ticket as bug/feature/question."}]
)
# Model may return a paragraph explaining its reasoning -> 150+ output tokens
After (capped to the job):
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=5, # one word fits easily
messages=[{"role": "user", "content": "Reply with exactly one word: bug, feature, or question."}]
)
Why it saves tokens: You only pay for output tokens actually generated, so max_tokens is a ceiling rather than a fixed charge. Its real value is protecting against the worst case: a malformed prompt, an injection, or a model that decides to "explain" can otherwise run until it hits the context limit. Sizing the cap to the task (a few tokens for a label, a few hundred for a summary) bounds the bill on every single call.
- Size it to the expected output plus a small margin, not the model maximum.
- Watch for truncated responses (
stop_reason: max_tokens) — that's the signal your cap is too tight, so tune rather than guess. - Pair a tight cap with an instruction telling the model to be brief, so it doesn't get cut off mid-sentence.