1. Cost management
LLMs can be expensive to host and operate.
2. LLM lifecycle: Cost management
While tracking costs is part of our monitoring strategy, we also need to think about how to lower these costs.
3. Cost management
We'll discuss managing LLM application costs, focusing on the primary cost driver: the model. Costs can escalate significantly based on hosting and usage. For self-hosted models, costs arise from hosting, while for externally hosted models, costs come from usage.
4. Breaking down LLM costs
Costs for cloud hosting depend on the duration the server remains operational. For on-premise hosting, expenses are based on hardware costs and expenses like maintenance and electricity.
Externally hosted proprietary models have usage-based costs, determined by the number of calls and the number of tokens per call.
Although there are tools to compare model costs, it's harder to compare self-hosted and externally hosted options due to their cost breakdown. Generally, there are three cost optimization strategies: choosing the right model, optimizing prompts and optimizing the number of calls.
5. Strategy 1: Choose the right model
The initial cost optimization strategy involves choosing the appropriate model, as discussed previously. Instead of thinking about the highest quality model, think about the most cost-effective model that still accomplishes the task.
Additionally, consider using multiple smaller task-specific models instead of one large model.
For self-hosted models, techniques like model-size reduction allow efficient operation on less expensive hardware without sacrificing performance.
6. Strategy 2: Optimize prompts
The second cost optimization strategy focuses on optimizing prompts, making them shorter while still conveying necessary information. Prompt compression tools facilitate this task by automatically replacing and eliminating redundant wording.
Content reduction involves a deliberate effort to eliminate unnecessary text. For instance, chat applications often inject all past conversations into the prompt (known as chat memory); but much of this history can be excluded. Another deliberate approach is to optimize our RAG pipeline to return fewer results.
7. Strategy 3: Optimize the number of calls
The third cost optimization strategy involves reducing the number of calls, which can be achieved by consolidating multiple prompts into a single call, a practice known as batching.
In settings where repetitive questions are common, we can cache responses to reduce LLM usage, also speeding up response times.
Agents typically comprise multiple LLM calls, so optimizing this and having restrictions is advisable.
We can set quotas and rate limits on LLM calls, though this could cause the application to stop working when the limit is reached.
Always consider tasks which don't require LLMs, like summarization and text extraction.
8. Cost metrics and prognosis
So what cost metrics should we track to make a prognosis? For self-hosted models, you'll want to monitor the cost per machine per time unit, while for externally hosted setups, it's crucial to track the cost per session. A session can have multiple LLM calls, so a session is often a better abstraction as opposed to calls.
Understanding how your user base will grow and how costs will scale alongside this growth is key. For externally hosted solutions, cost typically scales linearly with users, whereas self-hosted setups often scale per machine, loosely tied to user numbers.
9. Let's practice!
Now that we have an understanding how we can manage the cost of our LLM application, it's time to put this knowledge into practice.