Picking the best AI tool for the job

1. Picking the best AI tool for the job

Hello! In this video, we will learn how to evaluate AI models for code generation, so we can choose the best tool for the task.

2. Benchmarks

Benchmarks evaluate model performance on tasks like question answering or coding, with specialized tests for each area.

3. Benchmarks

Results are often displayed on leaderboards ranking models by accuracy and size, with higher accuracy being better. Since leaderboards are complex, a simpler approach is to test selected tasks across models and compare quality or cost.

4. Model comparison

Let’s see this in practice by comparing two models! Feel free to compare any of your favorite models here, for this illustration, we’re comparing a GPT model with a Gemini model. We’ll revisit an example from an earlier video, asking each model to write tests for a function.

5. Model comparison

Here, we see both GPT’s response and Gemini’s. But how do we compare them? Let’s break it down.

6. Tokenization

LLMs work with numerical data, so they can't process words directly. Instead, words are converted into numerical representations called tokens. The idea behind tokens is this: a single word can be broken down into one or more parts, depending on the language and the model. Model providers like OpenAI typically charge based on the number of tokens, both in the prompt and response. As an example, let’s calculate the cost of our prompt input. Technically, each model has its own tokenizer, meaning the same prompt might result in a different token count depending on the model. For simplicity, we’ll use a standard tokenizer here.

7. Cost

Suppose GPT charges $2 per million input tokens and Gemini $1.25. Our input prompt has 146 tokens, so GPT costs about 0.03 cents, Gemini about 0.02.

8. Cost

Now let’s check the outputs. GPT produced 681 tokens, Gemini 2,079. With rates of $8 per million output tokens for GPT and $10 for Gemini, this comes to 0.5 cents for GPT and 2 cents for Gemini. Gemini’s verbosity plus higher output rates make it more expensive overall.

9. Reasoning models

If we use a reasoning model, costs rise further since reasoning traces also consume tokens. Traces can slow responses and interfere with systems expecting only final outputs. Still, they improve transparency, accuracy, and debugging. There’s always a trade-off.

10. Output quality

Now let’s look at the quality of the output. For this test, we're simply following the steps provided by each model. When running the code from GPT, we encounter an error indicating that there’s no database to connect to. This isn’t a coding error, it’s due to a missing setup step that the model expects us to handle.

11. Output quality

In our prompt, we asked the model to mock the database connection. While GPT provides an implementation that assumes a database connection, it doesn’t deliver a complete end-to-end solution.

12. Output quality

Finally, when testing the solution from Gemini, we see a complete end-to-end implementation using pytest, as requested. Although it’s more verbose (and therefore more expensive) Gemini’s solution appears to be more complete in this case.

13. Let's practice!

Let's practice how to pick the best models for our use cases!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.