Using the Cortex COMPLETE function

1. Using the Cortex COMPLETE function

Hello again. In this video, we will use Cortex complete function to build a multi-turn chat application. Let's start. First though, sign in to your Snowflake account if you're not already signed in. Navigate to projects on the left panel, select notebooks, click on the notebook titled using LLM functions and select the start button at the top to start the notebook session. It takes a couple of seconds. Click on packages at the top right and search for Snowflake package to install. It takes a couple of seconds as well. We also need another package. Look for a Snowflake ML Python and install that as well. First, we import the active Snowpark session and then set the context for the rest of the notebook cells to use the SkiGear support DB and SkiGear support schema. Look at the code below using complete. Here we are using Python to call the complete function. As we saw before, complete can be used with different models. In this case, we're using the LLAMA 3.1 405 billion parameter model for its multilingual capability. If your application needs to interact with a wide range of languages, say French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, or Korean, you could use Mistral Large 2 or the LLAMA 3.1 405 billion model. If you're building a summarization, extraction, or question-and-answer use case involving lengthy documents or an extremely large knowledge base, it is useful to turn to models with larger context windows, such as Claude 3.5 Sonnet, which offers a context window size of 200,000 tokens, or Jamba Instruct, which boasts an even larger window of 256,000 tokens. For use cases requiring low latency or high throughput that only need low to moderate reasoning, small models below 10 billion parameter size can be the ideal choice, such as the LLAMA 3.2 1 billion and 3 billion parameter models. Not all models are available in every region, but for this course, we are working in AWS US West 2 Oregon region, which supports the widest set of models. Back to the arguments used by COMPLETE. Model is a single string that represents the model COMPLETE will use. The prompt or history can be either a column of prompts or a single string that are sent to the LLM. Here we are asking LLAMA how snowflakes get their unique patterns. Another thing to note is COMPLETE is stateless. It does not retain any state from one call to the next. In order to reference all the previous model responses and prompts, we must pass all previous inputs as part of the prompt or history to provide a stateful conversational experience. These must be passed in chronological order for proper reference. This allows for more complex multi-turn conversations that retain memory of previous stages of the conversation. Please note that with each iteration, more tokens are consumed and this can accumulate over larger conversations. It turns out that talk is not, in fact, cheap. Now let's look at the system and user roles in COMPLETE. When we call Cortex COMPLETE, we must ensure that each role key has an associated content key. The two roles we look at here are the system and user. The system prompt is an initial plain English prompt to the language model to provide it with background information and instructions for a response style. For example, here we instruct the model to answer programming questions in the style of a rancher. If we provide a system prompt, there can only be one and it must be the first message in the array of messages. Then the user prompt contains the meat of the request for which we expect a response from the model. For example, here we ask the model about the role of semicolons in JavaScript. You will notice that we've also added additional parameters to our call to COMPLETE using options. For now, we will start with just setting guardrails to true, which activates Cortex GUARD. Once you activate Cortex GUARD, language model responses associated with harmful content such as violent crimes, hate, sexual content, self-harm, and others will be automatically filtered out and the model will return a response filtered by Cortex GUARD message. Under the hood, Cortex GUARD, currently powered by LlamaGuard2 from Meta, works by evaluating the responses of a language model before that output is returned to the application. You can see how Cortex GUARD evaluates and filters out responses. Now that we are utilizing options, we need to know two key differences. First, when we utilize options, the prompt argument must be an array of objects representing a conversation in chronological order, where each object contains the role and content key. We will cover more about how to do this later. Second, when options are present, we will get a full response object from COMPLETE. It now returns the JSON object from the LLM that contains the response to the prompt, information about the model used, and the number of tokens used at each step. Next, let's take a look at how COMPLETE can be used for multi-turn conversations in chat. We mentioned before that although COMPLETE is stateless, we can choose to pass in the conversation history, which will enable multi-turn conversations. Let's look at the messages with history list. First, we start off with a system prompt. Then, we define a user prompt that is just the question the user asked the model. Next comes the model's response captured under the assistant role. This way, we could keep appending our question to the model and its response in messages with history variable and recursively call the COMPLETE function to build a multi-turn chat application. For this multi-turn chat experience, it would be really great if the user could read the output text similarly to how they would in a real conversation, word by word, or in LLM speak, token by token. Another word for this is streaming. We can enable streaming with a keyword argument stream by setting it equal to true. When enabled, a generator function is returned that provides the streaming output as it is received. Each update is a string containing the new text content since the previous update. Doing so dramatically reduces the perceived latency to the user because each token, as soon as it is generated, appears on the output screen rather than waiting for the entire response to be generated. This can make the user experience seem more interactive and responsive when it is utilized in the user-facing applications that our models support. Please note that streaming is enabled for Python only, not SQL. The reason for this is that SQL, as mentioned before, is meant and optimized for batch processing scenarios where latency of an individual response is not a concern. When we want a chatbot to feel more creative and interesting, we can turn to my favorite parameters, temperature and top p. Temperature is a value from 0 to 1 that we pass to the model to control the randomness of the output that is returned to us. More technically, a low temperature makes the model more confident in its top choices. Let's try it by adding a high temperature to our same chat with the Western Rancher about JavaScript. Oh, the response is a bit more fun. As an alternative to using temperature, we can also use top p. Top p achieves a similar result but in a different way. Top p works by restricting the generated tokens to tokens that fit in a cumulative distribution below the threshold you set. For example, if top p is equal to 0.3, only tokens comprising the 30% probability mass would be considered. Moving from chat experiences to more task-based actions, it can be especially useful to constrain the number of tokens an LLM can generate. If we want to limit the number of output tokens so we do not get an entire book as an output, we can use max tokens which sets a limit on the maximum number of tokens the model can generate in its response. This can be very useful for us where we need to constrain the output but we should be aware that small token values may result in truncated responses. An example of this would be where you want the LLM to only respond with either a number from 0 to 5 or only the word yes or no. In this case, we can limit the response by setting the max tokens to 1 so that a longer response would not be returned. In many cases, when we are asking the LLM to perform constrained tasks like this, we will want to do so in batch. The easiest way to use LLMs in a batch is through SQL. In this example, we use complete to critique an entire column of transcripts in a snowflake table called call transcripts. Now, let's talk about the output of complete by clicking on the first row in our output to see it closely. As we said before, when we specify options, complete will return a JSON string which will contain the following keys, choices, created, model, and usage. The choices key in options returns an array of the model's responses. Each of these are in an object that contains a key. Messages holds the response that the model generated based on the last prompt passed to it. Currently, this only holds one response. We also see a Unix timestamp of when the response was generated and the model that generated it. It also contains three subkeys under usage, completion tokens, prompt tokens, and total tokens used for this completion. We saw earlier that Cortex complete supports a number of foundation models such as Llama, Gemma, Mistral, Reka, and Anthropic family of models. However, what is the right model to use for your use case and requirement? To first learn the art of the possible, we often want to start experimentation with the best large models. This helps us understand what our capability constraint is. Then, to optimize against both our latency and compute budget, we want to scale down and find the smallest model that can successfully do what we need to do. Smaller models can be really useful for simpler, more constrained tasks that don't require a large amount of context. Only when we need to perform more complex tasks or handle a large amount of context do we need to turn to state-of-the-art models with hundreds of billions of parameters. If we choose a model that has too small of a context window, then any inputs that exceeds the token number will result in a context overflow. If we really are bumping up against a tight budget of time and cost, but need to complete a more complex task, stay tuned. Phew, that was a little long. Let's review what we covered. We looked at Cortex complete, which generates a response given a prompt and choice of LLM. We looked at how we can use a series of options to guide the output of this model and how to pass prompts to the model as well. We learned how complete is stateless by default and how we can build stateful applications that reference previous model responses by using system, user, and assistant roles. We talked about what to consider when choosing models for complete to use. We looked at options such as max tokens, temperature, and top P. Lastly, we looked at cost considerations when developing and testing in the Cortex environment and all the options available for use. In the next video, we will get some hands-on experience using the task-specific functions that Cortex offers.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.