Multimodal chats hide a recurring cost: every image and attachment in the thread is converted to tokens and re-sent on every follow-up turn, the same way text is. A single screenshot commonly lands in the high hundreds to a couple thousand input tokens depending on its resolution, and a multi-page PDF can be far more. If you paste a screenshot, get an answer, and keep chatting, you re-pay for those pixels on turn after turn even though you're now only discussing text.
Before (wasteful):
Turn 1: [uploads a 1,600-token dashboard screenshot] "What's the error in this chart?"
Turn 2: "Okay, how do I fix that query?" (screenshot re-sent)
Turn 3: "Write a test for it." (screenshot re-sent again)
Three turns, three full charges for an image you only needed once.
After (lean):
Turn 1: [uploads screenshot] "Transcribe the error text and the failing query verbatim, then we'll continue."
Turn 2+: Continue in a fresh message (or new chat) using the transcribed text. No image re-sent.
You convert the visual into a few dozen tokens of text once, then work from the text.
Why it works: Vision models tokenize images by area or tile count, so a high-res attachment is genuinely expensive, and a stateless API re-bills the whole conversation each turn. Once the salient content is in text form, the pixels add cost without adding information. Removing them shrinks every subsequent request.
Practical moves:
- Ask the model to extract what matters (error text, table values, layout notes) on the first turn, then drop the image.
- In chat UIs that keep attachments inline, start a new chat once you're past the visual step and paste the extracted text.
- Downscale before uploading when fine detail isn't needed; a 4K screenshot of a few lines of red text wastes most of its tokens.
- Keep the image only while you're genuinely still asking about visual specifics (exact colors, spatial layout).
The rule of thumb: an image is an expensive input you should consume once and convert to cheap text, not a sticky attachment you drag through the whole conversation.