Many models support prompt caching, which is highly advantageous for lengthy system prompts, repeated references to the same video, or any other token-intensive operations.
Is it feasible to implement a prompt caching system within callin.io?
Information on your callin.io setup
- callin.io version: 1.81.4
- Database (default: SQLite): SQLite
- callin.io EXECUTIONS_PROCESS setting (default: own, main): Own
- Running callin.io via (Docker, npm, callin.io cloud, desktop app): Google Cloud
- Operating system: WIndows10
When utilizing OpenAI, prompt caching occurs automatically.
Within callin.io, when you set up a prompt, it's transmitted with every execution. There isn't an integrated way to "cache" it independently to conserve tokens.
This is fundamental to how LLM APIs operate: each request requires the complete context (system, user, and assistant messages) to produce a response.
However, OpenAI already manages token savings internally when you submit similar requests repeatedly.
You can find more details in this documentation:
If this response addresses your question, please consider marking it as a solution.
I reached out to OpenAI's developers, and they informed me differently.
What information did they provide? That the official documentation is out of date and they have discontinued prompt caching?
What if we're utilizing Google Gemini? We'll require a new configuration option within the Gemini model node to specify the _cache_name
. The request structure appears as follows:
{
"contents": [
{
"parts":[
{
"text": "Please summarize this transcript"
}
],
"role": "user"
},
],
"cachedContent": "'$CACHE_NAME'"
}
I'm looking into this as well.
Here's what the OpenRouter documentation states regarding caching:
Gemini isn't mentioned in the documentation. However,
Today, I experimented with the same chat pipeline using Sonnet 3.7 and Gemini 2.5 Pro through OpenRouter. For comparable requests in the middle of a chat (7k input, 500 output tokens), the cost with Sonnet was 43 times higher than with Gemini.
Based on the model costs: $1 per 1k input / $10 per 100k output for Gemini, and $3 per 1k input / $15 per 100k output for Sonnet, the difference should be around 3x, not 43x.
I also observed requests for Gemini where the costs were approximately 10x lower than Sonnet (at the beginning of a dialogue).
Therefore, it's highly likely that automatic caching is being applied for Gemini when using the OpenRouter Model node.
My system prompt alone is 4k tokens.
I recently discovered that OpenRouter was utilizing my Google AI Studio account key. Consequently, the $300 in trial credits I had there were depleted, with the primary cost being charged to my Google AI Studio account.
After experimenting with API keys and their fallback mechanisms within OpenRouter, I now observe the costs directly in OpenRouter, as it primarily uses Google Vertex.
Do the chat model nodes already include prompt caching?
The question seems to be about how to enable explicit context caching to ensure cost savings:
I completely agree. I'm hoping that explicit caching will be enabled for the Gemini model mode.