The single biggest model-selection waste is using a flagship model as your default for everything. Classification, formatting, short rewrites, and tag extraction do not need a frontier model, and the price gap is large.
Consider Anthropic's lineup as a concrete example. Claude Opus is roughly $5 input / $25 output per million tokens, Sonnet is around $3 / $15, and Haiku is about $1 / $5. So a tiny task on Opus output costs ~5x what the same task costs on Haiku. The same tiering exists across providers (OpenAI's gpt-5 mini/nano tiers, Gemini Flash vs Pro).
Before (wasteful):
# Every call hits the flagship, including a yes/no check
resp = client.messages.create(
model="claude-opus-4-8", # $5 / $25 per MTok
max_tokens=5,
messages=[{"role": "user",
"content": f"Is this email spam? Reply yes or no.\n{email}"}],
)
After (lean):
resp = client.messages.create(
model="claude-haiku-4-5", # ~$1 / $5 per MTok
max_tokens=5,
messages=[{"role": "user",
"content": f"Is this email spam? Reply yes or no.\n{email}"}],
)
Why it saves: a binary classifier produces a handful of output tokens and needs little reasoning. The smaller model returns the same answer here, and you pay roughly one-fifth the per-token rate. Latency usually drops too, since smaller models respond faster.
The practical move: audit your call sites and bucket them as trivial (classify, extract, format), moderate (summarize, draft), and hard (multi-step reasoning, tricky code, ambiguous judgment). Route the first two buckets down a tier and reserve the flagship for the last. Validate quality on a sample before rolling out — if a downgraded route regresses, bump it back up. You are not chasing a universal cheap model; you are right-sizing per task.