It’s a bit of a chicken & egg thing and depends a ton on how LLM is applied within an app. I always start at the core design of the integration and focus hard on the problem it solves. Why are you using an LLM in the first place? What is/are the function/s it needs to perform in the context of the user interaction? These are the kind of questions that help you understand the constraints you need to implement.
So for example; a project I’m working on is a diagramming tool, and I’m implementing an AI layer on top of it so users can refine/edit/generate diagrams. The tool creates maps structured into a JSON schema, but these can get really long, sometime s thousands of lines depending on the complexity of the diagram.
Obviously feeding an entire diagram or having the AI generate an entire diagram is expensive here, so the fix was building a deterministic translation layer that compressed the diagram into a compact semantic model for the LLM, stripping visual noise (x/y coordinates), deduplicating relationships, resolving references etc.
With this and keeping the interact, we cut token usage by ~75% across the app. On the output side, the LLM only produces changes needed, not the full diagram. Layout, validation, and rendering are computed client-side for free so costs only scale with what the user asks for.
With good UX as well, we can pay attention to what users ask for, and create “quick actions” that use the LLM within closed loop subsystems. Since we assign a credit system for AI tool usage, we’re better able to accurately assign credit costs to quick actions because each action has a defined scope.
TLDR: make the LLM do less, then put hard limits around the smaller set of things it’s allowed to do
It’s a bit of a chicken & egg thing and depends a ton on how LLM is applied within an app. I always start at the core design of the integration and focus hard on the problem it solves. Why are you using an LLM in the first place? What is/are the function/s it needs to perform in the context of the user interaction? These are the kind of questions that help you understand the constraints you need to implement. So for example; a project I’m working on is a diagramming tool, and I’m implementing an AI layer on top of it so users can refine/edit/generate diagrams. The tool creates maps structured into a JSON schema, but these can get really long, sometime s thousands of lines depending on the complexity of the diagram. Obviously feeding an entire diagram or having the AI generate an entire diagram is expensive here, so the fix was building a deterministic translation layer that compressed the diagram into a compact semantic model for the LLM, stripping visual noise (x/y coordinates), deduplicating relationships, resolving references etc.
With this and keeping the interact, we cut token usage by ~75% across the app. On the output side, the LLM only produces changes needed, not the full diagram. Layout, validation, and rendering are computed client-side for free so costs only scale with what the user asks for. With good UX as well, we can pay attention to what users ask for, and create “quick actions” that use the LLM within closed loop subsystems. Since we assign a credit system for AI tool usage, we’re better able to accurately assign credit costs to quick actions because each action has a defined scope.
TLDR: make the LLM do less, then put hard limits around the smaller set of things it’s allowed to do
Certainly. I use LiteLLM to get more cache and save more money