this is exactly the problem we keep running into. the cost isn't just "how many tokens did this call use," its "how many tokens did this entire user action consume across all the agent loops, retries, tool calls, and embeddings."
most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it, or see that the agent looped 4 times because the first 3 outputs failed validation. so you end up building custom logging and hoping the numbers add up.
we've been building an APM (immersivefusion.com) where cost is a first-class dimension on every trace. so you can see one request flow from the UI through your backend through the agent workflow, and each span carries its token cost. the idea is you should be able to answer "what does a checkout cost when the recommendation agent is in the loop" without stitching together 3 different tools.
for the forecasting question specifically, i think the answer is you need a few weeks of production data with good instrumentation and then you can build a distribution. the variance is real but its not random, its usually a few specific flows that blow up (retries on bad structured output like @hkonte mentioned, or RAG queries that hit the wrong chunk size). once you can see which flows are expensive the guardrails become obvious.
also wrote a longer piece on this if anyone's interested: immersivefusion.com/blog/end-to-end-observability-from-ui-to-ai-agent-to-invoice
I love the idea. @Edgee.ai we are tracking cost in real time by tag, LLM, ...but not yet on forecast, and indeed it will be very useful. Something to explore; thanks for the feedback.
That’s great. Real-time tracking is a big step already. The tricky part we kept running into was the variance itself, especially with retries and agent loops. That’s partly why we started experimenting with Oxlo.ai (https://oxlo.ai) where the pricing model absorbs that variance so builders don’t have to constantly model token risk.
local models are better in controlling costs rather commercial models are very high and no control on this cost..how ever again local models training setup to be archietected very well to train this continoulsly
Local models help remove token cost uncertainty, but they shift the problem to infrastructure and ops. GPUs, scaling, maintenance, and latency can add up quickly depending on the workload. For many builders it ends up being a tradeoff between predictable infra cost and flexible API usage.
That’s true, but AI is interesting because consumption-based pricing introduces a lot more variance than typical SaaS infrastructure. One user action can trigger dozens of model calls in an agent workflow. That’s partly why we started experimenting with models like https://oxlo.ai where the pricing flips back to a fixed subscription and we absorb the usage spikes.
Honestly, if you're designing your agent workflows properly with hard limits on retries and tool calls, the variance shouldn't be that wild. Most of the unpredictability comes from not having those guardrails in place early on. A few weeks of real production data usually shows the average cost is more stable than you'd expect.
True, but for early stage builders it’s harder to design those guardrails upfront. A lot of the time you only discover the retry patterns and cost spikes once real users start hitting the system.
Fair point. And honestly, with more non-technical builders shipping agent-based products these days, that's probably where a service like this makes the most sense – for people who don't yet have the experience to know what guardrails to put in place.
Exactly. That’s actually why we started building Oxlo.ai. Early stage builders usually just want to experiment without worrying too much about token cost spikes.
makes sense, it really depends on the use cases, I'm building my version of claw openwalrus for the local LLMs first goal, I think myself will use local models for daily tasks that heavily depend on tool callings, but for coding or doing research, I'll keep using remote models
and this topic actually inspires me that I can introduce a builtin gas meter for tokens
this is exactly the problem we keep running into. the cost isn't just "how many tokens did this call use," its "how many tokens did this entire user action consume across all the agent loops, retries, tool calls, and embeddings."
most observability tools show you the LLM call as one flat span. you can see it cost X tokens but you cant correlate it with the API request that triggered it, or see that the agent looped 4 times because the first 3 outputs failed validation. so you end up building custom logging and hoping the numbers add up.
we've been building an APM (immersivefusion.com) where cost is a first-class dimension on every trace. so you can see one request flow from the UI through your backend through the agent workflow, and each span carries its token cost. the idea is you should be able to answer "what does a checkout cost when the recommendation agent is in the loop" without stitching together 3 different tools.
for the forecasting question specifically, i think the answer is you need a few weeks of production data with good instrumentation and then you can build a distribution. the variance is real but its not random, its usually a few specific flows that blow up (retries on bad structured output like @hkonte mentioned, or RAG queries that hit the wrong chunk size). once you can see which flows are expensive the guardrails become obvious.
also wrote a longer piece on this if anyone's interested: immersivefusion.com/blog/end-to-end-observability-from-ui-to-ai-agent-to-invoice
Agreed. The real cost unit becomes the whole agent workflow, not a single LLM call. One user action can trigger dozens of calls.
We ran into the same issue and ended up building https://oxlo.ai to make the cost side more predictable for agent workloads.
I love the idea. @Edgee.ai we are tracking cost in real time by tag, LLM, ...but not yet on forecast, and indeed it will be very useful. Something to explore; thanks for the feedback.
That’s great. Real-time tracking is a big step already. The tricky part we kept running into was the variance itself, especially with retries and agent loops. That’s partly why we started experimenting with Oxlo.ai (https://oxlo.ai) where the pricing model absorbs that variance so builders don’t have to constantly model token risk.
local models are better in controlling costs rather commercial models are very high and no control on this cost..how ever again local models training setup to be archietected very well to train this continoulsly
Local models help remove token cost uncertainty, but they shift the problem to infrastructure and ops. GPUs, scaling, maintenance, and latency can add up quickly depending on the workload. For many builders it ends up being a tradeoff between predictable infra cost and flexible API usage.
That isn't true, if you run local models you'll also need to have to spend on operations.
Maybe focus first on providing value and later you can optimize this setup.
It feels like the traditional fixed SaaS pricing model is slowly shifting toward more consumption-based pricing.
That’s true, but AI is interesting because consumption-based pricing introduces a lot more variance than typical SaaS infrastructure. One user action can trigger dozens of model calls in an agent workflow. That’s partly why we started experimenting with models like https://oxlo.ai where the pricing flips back to a fixed subscription and we absorb the usage spikes.
Honestly, if you're designing your agent workflows properly with hard limits on retries and tool calls, the variance shouldn't be that wild. Most of the unpredictability comes from not having those guardrails in place early on. A few weeks of real production data usually shows the average cost is more stable than you'd expect.
True, but for early stage builders it’s harder to design those guardrails upfront. A lot of the time you only discover the retry patterns and cost spikes once real users start hitting the system.
Fair point. And honestly, with more non-technical builders shipping agent-based products these days, that's probably where a service like this makes the most sense – for people who don't yet have the experience to know what guardrails to put in place.
Exactly. That’s actually why we started building Oxlo.ai. Early stage builders usually just want to experiment without worrying too much about token cost spikes.
imo switch to local models could be an option
Local models solve the marginal cost problem, but they move the complexity into infrastructure and throughput planning instead.
makes sense, it really depends on the use cases, I'm building my version of claw openwalrus for the local LLMs first goal, I think myself will use local models for daily tasks that heavily depend on tool callings, but for coding or doing research, I'll keep using remote models
and this topic actually inspires me that I can introduce a builtin gas meter for tokens
Just add very hard high limits and add instrumentation so you can track it and re-evaluate it accordingly.
This takes a couple of hours maximum at best.
Sounds like a plan, But what if you can just pay a fixed cost every month and not worry about anything?