Prompt caches aren't private to one conversation — multiple users hitting the same byte-identical prefix all read the same cache entry. That means a large shared instruction block can be written once and read cheaply by everyone. But one personalized byte in the system prompt quietly destroys this: each user gets a distinct prefix, so each pays the full write cost and nobody shares.
Before
system = [{
"type": "text",
"text": f"You are Acme's support agent.\nUser: {user.name} ({user.id})\n\n{POLICY_DOCS}",
"cache_control": {"type": "ephemeral"},
}]
{user.name} and {user.id} sit at the front of the prefix, so the 8KB POLICY_DOCS after them is cached per user. With 5,000 users that's 5,000 separate cache writes of the same policy text.
After
system = [{
"type": "text",
"text": f"You are Acme's support agent.\n\n{POLICY_DOCS}", # identical for everyone
"cache_control": {"type": "ephemeral"},
}]
messages = [
{"role": "user", "content": f"[context] Acting for {user.name} (id {user.id})."},
{"role": "user", "content": user_question},
]
The personalization moves into the message turns — after the cached prefix — so it invalidates nothing ahead of it.
Why it works
The cache key is derived from the exact bytes up to each breakpoint, and entries are scoped per model, not per user or per conversation. Keep the system prompt frozen and identical, and the first user to arrive pays the ~1.25x write while every subsequent user pays the ~0.1x read for that shared block. Per-user facts belong in messages (or a role: "system" message where supported), which sit after the prefix and only affect bytes from that point onward.
Confirm sharing works by watching a second user's first request: cache_read_input_tokens should already cover POLICY_DOCS even though that user never sent a prior message.