Tips library — TokenWise

⚙️Batching & Automation ~50% on input + output tokens

Move Every Non-Urgent Job to the Batch API and Pay Half Price

If a job doesn't need an answer in the next few seconds, send it through the Batch API instead of the live endpoint. The exact same request costs half as much.

Beginner 1 min Read →

💻Coding Assistants Often cuts output tokens 40-70% on edits to large files, varies by file size

Ask for the Patch, Not the Whole File

When editing an existing file, tell the assistant to return only the changed lines as a diff or snippet instead of regenerating the entire file.

Beginner 1 min Read →

🧠Context Management Images can be 1,000-2,000+ tokens each; removing stale ones cuts that per turn

Drop the Screenshot Once the Model Has Read It

Images, PDFs, and attachments are charged as tokens and re-sent every turn in a multimodal thread. After the model has described or transcribed one, you usually don't need to keep sending the pixels.

Beginner 2 min Read →

📊Measurement & Budgeting 10-30% on prompts you would have sent bloated

Count Tokens Before You Hit Send, Not After the Bill Arrives

Measure a prompt's token count before sending it, so you catch oversized context while trimming it is still free.

Beginner 1 min Read →

🎚️Model Selection 60-80% on routed traffic

Stop Paying Frontier Prices for Boilerplate Work

Most of your token spend is on tasks a small model handles perfectly. Match the model to the job instead of defaulting to your most expensive option for everything.

Beginner 1 min Read →

📐Output Control Shrinks classification and routing outputs substantially, frequently 5-15x fewer output tokens per call

Return IDs and Enums, Not Sentences

For classification, routing, and selection tasks, have the model emit a short code, ID, or enum value instead of a polite sentence. The downstream code only needs the token, not the prose around it.

Beginner 2 min Read →

♻️Prompt Caching & Reuse up to 90% on the cached portion

Freeze the Prefix: One Stray Timestamp Kills Your Whole Cache

Prompt caching is a prefix match. A single dynamic byte near the top of your prompt silently invalidates everything after it, so you pay full price every call without realizing it.

Beginner 1 min Read →

✍️Prompt Engineering 10-25% on short prompts

Stop Paying for 'Please' and 'I Was Wondering If'

Conversational filler and apologetic framing get tokenized and billed like any other text. Strip the social padding and lead with the instruction.

Beginner 1 min Read →

🔎Retrieval & RAG Eliminates the embedding API call and vector search on cache hits. The saved cost tracks your hit rate; it cuts embedding/retrieval spend, not the tokens sent to the LLM.

Cache the Context, Not Just the Answer

Cache the retrieved chunk set keyed by a normalized query, so popular or repeated questions skip the embedding call and vector search and reuse the same context block instead of rebuilding it every time.

Beginner 2 min Read →

⚙️Batching & Automation Often 40-70% fewer input tokens on bulk classification, varies with prompt size

Classify a Whole List in One Call, Not One Row at a Time

Send 20-50 items as a numbered list and get back a JSON array of labels, instead of paying for the same instruction prompt on every single row.

Beginner 1 min Read →

💻Coding Assistants often 30-60% on long sessions

Run /clear Between Tasks in Claude Code Instead of Letting Context Pile Up

Claude Code resends the whole conversation every turn. Finishing one task and starting an unrelated one in the same thread means you keep paying for stale tool output and dead files.

Beginner 1 min Read →

🧠Context Management 50-90% on file-heavy prompts

Paste the Function, Not the Whole File

Most coding questions need 20-40 lines, not your 800-line file. Send the relevant slice plus a one-line note about the rest, and your input shrinks dramatically without hurting the answer.

Beginner 2 min Read →

📊Measurement & Budgeting Prevents runaway-loop and leaked-key blowups; bounds worst-case spend rather than reducing normal usage

Set Hard Spend Caps in the Provider Console

Configure provider-side usage limits, budgets, and alerts so a bug, a retry storm, or a leaked key cannot quietly run your bill far past a ceiling you set in advance.

Beginner 3 min Read →

📐Output Control Often 30-60% fewer output tokens on short tasks

Strip the Preamble: Ask for the Answer Only

Chat models love to restate your question, add caveats, and offer follow-ups. On high-volume tasks those wrapper tokens dominate the bill. Tell the model to return only the payload.

Beginner 1 min Read →

♻️Prompt Caching & Reuse Cache reads run roughly 0.1x of base input price; the more users hit the same prefix, the closer your shared instructions get to free

Share One Cached System Prompt Across All Your Users

A single per-user byte (name, ID, locale) in the system prompt forks the cache into one entry per user. Strip personalization out of the prefix so every user reads the same cached block.

Beginner 1 min Read →

✍️Prompt Engineering Roughly 20-40% fewer follow-up turns on formatting-sensitive tasks, in our experience

Fence the Output Before It Wanders

State the constraints that usually trigger a do-over up front, so you don't pay for a second generation just to strip the preamble.

Beginner 2 min Read →

🔎Retrieval & RAG 70-95% on document-heavy prompts

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Dumping a full PDF or knowledge base into every prompt bills you for thousands of tokens the model never needed. Retrieve only the passages relevant to the question instead.

Beginner 2 min Read →

⚙️Batching & Automation Varies; commonly 20-60% on duplicate-heavy workloads

Deduplicate and Cache Identical Requests Before They Ever Hit the API

Real-world batches are full of repeats. Hash each request, send each unique prompt once, and fan the answer back out to every duplicate.

Beginner 1 min Read →

💻Coding Assistants 10-30% on context-heavy chats

Add a .cursorignore So Cursor Stops Indexing Your node_modules

Cursor's @codebase and automatic context can pull in build artifacts, lockfiles, and vendored dependencies. A .cursorignore file keeps that noise out of every prompt.

Beginner 1 min Read →

🧠Context Management Trims re-sent history; often 20-60% fewer input tokens per turn after a topic switch

Start a New Chat When the Topic Changes

Chat apps re-send your whole conversation with every message. When you switch tasks, the old turns become dead weight you keep paying to re-transmit — even with caching discounts.

Beginner 2 min Read →

📐Output Control Caps runaway costs; output tokens are typically 3-5x the input price

Set max_tokens as a Hard Cost Ceiling, Not an Afterthought

Output tokens are the expensive half of most API bills. Setting an explicit max_tokens on every API call turns an open-ended cost into a known maximum.

Beginner 1 min Read →

📐Output Control Trims tabular output noticeably, commonly 15-40% fewer tokens versus a Markdown table

Emit CSV, Not Markdown Tables

When the model returns rows of data your code will parse, ask for CSV instead of a Markdown table. The pipes, padding spaces, and separator row in Markdown are tokens that carry no data.

Beginner 2 min Read →

⚙️Batching & Automation 🔒 Pro

Use Targeted Retries with Backoff Instead of Blindly Re-Sending

Distinguish retryable errors from real failures, back off on rate limits, and resend only the failed items so you stop paying for accidental duplicate generations.

Intermediate 2 min Unlock →

💻Coding Assistants 🔒 Pro

Scope Copilot Chat With #file Instead of @workspace

@workspace tells Copilot to search your entire repo and stuff retrieved snippets into the prompt. For targeted work, naming specific files with #file is leaner and usually more accurate.

Intermediate 1 min Unlock →

🧠Context Management 🔒 Pro

Prune Bulky Tool Results Once You've Used Them

Tool and function-call outputs are the heaviest, most disposable thing in an agent transcript. Once the model has extracted what it needs, replace the raw result with a one-line stub before the next turn.

Intermediate 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Log input_tokens and output_tokens on Every Call to Find Your Real Waste

Persist the usage object from every API response with a feature tag, so you can see exactly which feature and which token type is draining your budget.

Intermediate 1 min Unlock →

🎚️Model Selection 🔒 Pro

Cascade: Try the Cheap Model First, Escalate Only When It Fails

Send every request to a small model first, programmatically check the answer, and only escalate to a frontier model when the cheap one falls short.

Intermediate 2 min Unlock →

📐Output Control 🔒 Pro

Use Stop Sequences to Cut Generation the Instant You Have Enough

A stop sequence halts generation the moment a chosen string appears — you stop paying for output the instant your data is complete, no truncation guesswork required.

Intermediate 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Cache Your Tool Definitions, Not Just the System Prompt

Tool schemas render before the system prompt, so a non-deterministic tool list silently blocks the cache for everything after it. Sort and freeze the tool array to make tools cacheable.

Intermediate 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Ask for the Diff, Not the Director's Cut

When revising a long artifact, request only the changed lines as a patch instead of having the model reprint the whole thing.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Naive fixed-length chunking splits ideas mid-sentence, forcing you to retrieve more chunks (and more overlap) to capture one answer. Chunk on semantic boundaries to send fewer tokens per query.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Stack Prompt Caching on Top of Your Batch Jobs for Compounding Savings

Batch requests support prompt caching. When every request shares a big instruction block or document, cache it once and the per-request cost collapses.

Intermediate 1 min Unlock →

💻Coding Assistants 🔒 Pro

Use Inline Completion for Trivial Edits, Save Chat for Reasoning

Route boilerplate and one-liners through tab-style inline completion instead of opening a chat panel, which drags in your whole conversation and attached files.

Intermediate 2 min Unlock →

🧠Context Management 🔒 Pro

Compress Long Threads with a Rolling Summary

Instead of dragging a 40-turn thread forward, periodically have the model write a compact state summary, then continue from that.

Intermediate 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Give Each Feature a Token Budget and Enforce It with max_tokens

Set an explicit per-feature output ceiling instead of leaving max_tokens at a huge default, and reserve big budgets only for features that truly need them.

Intermediate 1 min Unlock →

🎚️Model Selection 🔒 Pro

Don't Burn Reasoning Tokens on Tasks That Don't Reason

Model selection isn't just which model — it's which reasoning mode. Turn thinking down or off for straightforward work and reserve deep reasoning for genuinely hard problems.

Intermediate 2 min Unlock →

📐Output Control 🔒 Pro

Abort the Stream the Moment You Have Enough

When streaming, close the connection as soon as the part you care about arrives instead of letting the model run to its natural stop. You only pay for tokens actually generated before the abort.

Intermediate 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Know Your Minimum: Short Prompts Silently Refuse to Cache

Below a model-specific token floor, a cache_control marker does nothing — no error, just a full-price bill. Know the floor before you rely on caching.

Intermediate 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Two Sharp Examples Beat Eight Bloated Ones

Few-shot examples are usually the heaviest part of a prompt. Trim each one to the minimum that demonstrates the pattern, and use the fewest that hold accuracy.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Filter by Metadata Before You Search

Attach structured metadata to chunks and apply WHERE-style filters before the vector search runs, so you embed and rank a smaller candidate set and stuff fewer off-topic chunks into the prompt.

Intermediate 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Tag Every Call So You Know Which Feature Burns Tokens

Attach feature, user, and environment labels to every API call so your bill breaks down by what actually drives cost instead of one undifferentiated total.

Intermediate 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Order Your Prompt by Volatility: Tools, then System, then the Question

The model renders tools, then system, then messages. Put your most stable content first and your most volatile content last, or your breakpoints cache nothing reusable.

Intermediate 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Reference Your Data, Don't Re-Paste It Every Turn

In chat UIs and stateless APIs, re-pasting the same document or spec into every message silently multiplies your input cost. Send it once and refer back.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3

Vector similarity is approximate, so people inflate top-k to avoid missing the answer. A cheap reranking pass lets you fetch many candidates but send only the few that matter to the expensive LLM.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Run an Async Queue with a Concurrency Cap Instead of Firing All at Once

Push jobs through a bounded worker pool so you saturate your rate limit without tripping it, eliminating the retry storms and tier upgrades that quietly inflate cost.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Turn Off Auto Codebase Context When You Don't Need It

Features like 'codebase' auto-retrieval and automatic open-file context silently attach extra tokens to every message. Switch them off for narrow tasks and attach context explicitly.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Keep Working State in a File, Not in the Conversation

On long tasks, let the model write its plan, findings, and decisions to an external scratchpad and re-read only the slice it needs, instead of accumulating all of it as ever-growing conversation history.

Advanced 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Measure Cost-Per-Success, Not Cost-Per-Call

Compare two prompts or models on cost divided by successful outcomes, including retries and rework, so you stop chasing cheap calls that quietly fail and get redone.

Advanced 2 min Unlock →

🎚️Model Selection 🔒 Pro

Use a Big Model as the Planner, Small Models as the Workers

In agentic and multi-step pipelines, reserve the frontier model for orchestration and hard reasoning, and delegate bulk subtasks (search, read, extract) to a cheaper model.

Advanced 2 min Unlock →

📐Output Control 🔒 Pro

Design a Compact Output Schema (and Skip the Pretty-Printing)

When you need structured data, the shape you ask for directly determines token count. Short keys, no markdown scaffolding, and minified output cut tokens on every response — and the input echo if you loop.

Advanced 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Match TTL to Traffic, and Pre-Warm Before the First User Hits

The default 5-minute cache evaporates between bursts. Choose 5-minute vs 1-hour TTL by your traffic gaps, and pre-warm at startup to kill first-request latency.

Advanced 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Factor Your System Prompt and Cap the Output

Move stable rules into a reusable, cacheable system prompt once, and constrain the response so the model can't ramble — output tokens usually cost more per token than input.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query

Re-embedding unchanged documents and re-summarizing the same sources on every run quietly burns tokens. Compute these artifacts once, persist them, and reuse provider-side prompt caching for stable context.

Advanced 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Collapse Many Tiny Calls into One Structured Request

Ten one-item calls re-send your instructions ten times. Batch the items into a single request with a structured-output schema and pay the overhead once.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Pipe Logs Through head/grep Before Pasting Them Into an Assistant

A 4,000-line stack trace or verbose build log is mostly repetition the model doesn't need. Extract the signal first; pasting the whole thing is the single most wasteful coding-assistant habit.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Build a Sliding Window with a Cached, Stable Prefix

For API apps, cap history at the last N turns and put unchanging instructions first so prompt caching can discount the prefix.

Advanced 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Wire Up Spend Alerts and a Token Circuit Breaker Before You Need Them

Combine provider budget alerts with an in-app token meter that hard-stops a feature once it blows past its expected per-period budget.

Advanced 2 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Read the Usage Block to Prove Your Cache Actually Hits

A cache_control marker that silently never hits looks identical to one that works — until you read the three usage token fields and compute your real hit rate.

Advanced 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Try Zero-Shot Before You Pay for Examples

Examples are recurring input tokens on every call — test whether a crisp instruction does the job before you attach them by default.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Go Hybrid to Retrieve Less

Combine a keyword (BM25) score with the vector score so exact-term matches rank first, letting you lower top-k because the right chunk lands near the top instead of being padded around with semantic near-misses.

Advanced 2 min Unlock →

60 ways to spend fewer tokens

Move Every Non-Urgent Job to the Batch API and Pay Half Price

Ask for the Patch, Not the Whole File

Drop the Screenshot Once the Model Has Read It

Count Tokens Before You Hit Send, Not After the Bill Arrives

Stop Paying Frontier Prices for Boilerplate Work

Return IDs and Enums, Not Sentences

Freeze the Prefix: One Stray Timestamp Kills Your Whole Cache

Stop Paying for 'Please' and 'I Was Wondering If'

Cache the Context, Not Just the Answer

Classify a Whole List in One Call, Not One Row at a Time

Run /clear Between Tasks in Claude Code Instead of Letting Context Pile Up

Paste the Function, Not the Whole File

Set Hard Spend Caps in the Provider Console

Strip the Preamble: Ask for the Answer Only

Share One Cached System Prompt Across All Your Users

Fence the Output Before It Wanders

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Deduplicate and Cache Identical Requests Before They Ever Hit the API

Add a .cursorignore So Cursor Stops Indexing Your node_modules

Start a New Chat When the Topic Changes

Set max_tokens as a Hard Cost Ceiling, Not an Afterthought

Emit CSV, Not Markdown Tables

Use Targeted Retries with Backoff Instead of Blindly Re-Sending

Scope Copilot Chat With #file Instead of @workspace

Prune Bulky Tool Results Once You've Used Them

Log input_tokens and output_tokens on Every Call to Find Your Real Waste

Cascade: Try the Cheap Model First, Escalate Only When It Fails

Use Stop Sequences to Cut Generation the Instant You Have Enough

Cache Your Tool Definitions, Not Just the System Prompt

Ask for the Diff, Not the Director's Cut

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Stack Prompt Caching on Top of Your Batch Jobs for Compounding Savings

Use Inline Completion for Trivial Edits, Save Chat for Reasoning

Compress Long Threads with a Rolling Summary

Give Each Feature a Token Budget and Enforce It with max_tokens

Don't Burn Reasoning Tokens on Tasks That Don't Reason

Abort the Stream the Moment You Have Enough

Know Your Minimum: Short Prompts Silently Refuse to Cache

Two Sharp Examples Beat Eight Bloated Ones

Filter by Metadata Before You Search

Tag Every Call So You Know Which Feature Burns Tokens

Order Your Prompt by Volatility: Tools, then System, then the Question

Reference Your Data, Don't Re-Paste It Every Turn

Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3

Run an Async Queue with a Concurrency Cap Instead of Firing All at Once

Turn Off Auto Codebase Context When You Don't Need It

Keep Working State in a File, Not in the Conversation

Measure Cost-Per-Success, Not Cost-Per-Call

Use a Big Model as the Planner, Small Models as the Workers

Design a Compact Output Schema (and Skip the Pretty-Printing)

Match TTL to Traffic, and Pre-Warm Before the First User Hits

Factor Your System Prompt and Cap the Output

Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query

Collapse Many Tiny Calls into One Structured Request

Pipe Logs Through head/grep Before Pasting Them Into an Assistant

Build a Sliding Window with a Cached, Stable Prefix

Wire Up Spend Alerts and a Token Circuit Breaker Before You Need Them

Read the Usage Block to Prove Your Cache Actually Hits

Try Zero-Shot Before You Pay for Examples

Go Hybrid to Retrieve Less

Like what you see?