Most people discover a prompt was huge only when the usage dashboard updates hours later. Flip that: measure first, then decide.
Before (blind sending):
prompt = open("transcript.txt").read() # who knows how big?
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
After (count, then decide):
count = client.messages.count_tokens(
model="claude-opus-4-8", # count for the SAME model you'll run inference with
messages=[{"role": "user", "content": prompt}],
)
print(count.input_tokens) # e.g. 48210
if count.input_tokens > 8000:
prompt = trim_or_summarize(prompt)
Token counts are model-specific, so always pass the same model ID you'll use for the real call. For OpenAI models, tiktoken counts offline (enc = tiktoken.encoding_for_model("gpt-4o"); len(enc.encode(text))). Anthropic exposes a hosted count_tokens endpoint; Google's SDK has model.count_tokens().
Why it saves tokens: the API is stateless — your full input is re-sent and re-billed on every call. A retrieval step that silently grows from 2K to 50K tokens (someone pasted a whole log file) costs ~25x more on input, on every request, indefinitely. A pre-flight count lets you enforce a per-feature ceiling before the spend happens, not after.
Two honest caveats:
- A rough heuristic of "~4 characters per token" or "~0.75 words per token" is fine for English back-of-envelope math, but it drifts badly for code, JSON, non-Latin scripts, and emoji — use the real tokenizer when the number actually matters.
count_tokenson a hosted API is a network round-trip (tens to a few hundred ms), not free in latency.tiktokenis local and near-instant. Either way, count once per candidate prompt — don't call the hosted endpoint inside a tight loop; cache or batch it.
The payoff is a habit: when a single call would blow past your ceiling, you trim, summarize, or chunk before paying — instead of finding out next month.