Optimizing AI for speed, cost and quality

1. Optimizing AI for speed, cost and quality

Welcome back! Let's explore more fine-grained options for controlling and budgeting AI model usage.

2. Metrics

There are three main metrics that are crucial to monitor when building an AI-powered workflow. First, is latency, which is the time it takes for a model to generate a response. Optimizing latency often involves trade-offs: smaller models are faster, but they may produce lower-quality results.

3. Metrics

The second is token cost. This reflects how expensive an AI model is to run. As we saw in the previous video, token cost is calculated based on both the input tokens and the output tokens. Optimizing token cost implies trimming unnecessary context and setting token limits in our API calls.

4. Metrics

The last one is output quality. There are different ways to define output quality. In code generation tasks, for example, one common metric is how often the generated code works as intended. In practice, it’s very difficult to optimize all three metrics at the same time. Faster models may be cheaper, but they often sacrifice quality. Higher quality usually comes at a higher cost and with increased latency.

5. Model benchmarking

We have already seen in previous videos how model benchmarking is done in code generation research. There are several well-established benchmarks designed to evaluate language models on code generation, often balancing different properties of code. Benchmarks like HumanEval and MultiPL-E focus on code generation and functional correctness, but not on real-world engineering integration. They mostly include tasks from the competitive programming domain, rather than challenges found in actual software development projects.

6. Model benchmarking

BigCodeBench, in contrast, is used to evaluate models on more realistic code generation tasks, beyond algorithmic problems.

7. Model benchmarking

SWE bench also reflects real bug-fixing scenarios using issues and pull requests from GitHub, which is the most widely used platform in software engineering.

8. Model benchmarking

And finally, benchmarks like COFFE focus on efficient code generation, measuring not just correctness but also runtime performance and memory usage.

9. Prompt versioning

Benchmarks are helpful to compare models, but they don't help optimize model use. Let's take a look at some other techniques for that. First of all, we have prompt versioning. As we have already seen, coming up with the best prompt is an iterative process that involves a lot of trial and error. That is why maintaining and improving prompts over time is essential for both reliability and reproducibility.

10. Prompt versioning

A recommended practice here is to use prompt templates with version tags. This includes storing prompts in a version-controlled system. It makes it easier to track changes, ensure consistency, and roll back if a newer version causes degraded performance.

11. Prompt versioning

Another helpful strategy is to use variables or placeholders within our prompts. This avoids repeating similar text and helps standardize prompt structure across use cases.

12. Prompt caching

Another strategy is prompt caching, which helps reduce redundant calls to the model. Caching works by storing the combination of prompt, input, model, and temperature. That way, if we get a new input that is similar to one we’ve seen and cached before, we can reuse or adapt the previous result instead of making a fresh API call. Finally, when the context gets long, the input token cost can grow very quickly. One more strategy to keep in mind is combining multiple subtasks into a single prompt, especially when they share the same context.

13. Let's practice!

Let's practice with some exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.