Preparing your Training Data

1. Preparing your Training Data

Welcome back. In this video, you will start preparing the training data needed to fine-tune a foundation model. We learned from the previous experiment that Mistral-Large was better at generating custom responses than Mistral-7b. This is great, but large models are more costly to run than smaller ones. An ideal solution would be to improve the accuracy of the small model through fine-tuning. This is what we're going to do next. We will fine-tune Mistral-7b to optimize for cost, but at the same time, improving the accuracy of the response. If you're not logged into your Snowflake account, please pause the video now to log into your account. Navigate to the Projects tab in the left panel and select Notebooks. Click on the Fine-tuning Mistral-7b notebook and select the Start button at the top right to initiate the notebook session. Let's quickly run through the serves up to the Mistral-Large response section. We learned from the previous experiment that Mistral-Large was better at generating custom responses for tickets than Mistral-7b, and that neither model generated ideal responses for our support tickets. We want to fine-tune the smaller Mistral-7b model to generate desired responses for our customers. We need the data to fine-tune the model, meaning we need a set of prompts and their ideal responses for the support tickets as fine-tuning data set. For this purpose, we could filter the correct responses Mistral-Large had generated and use them for fine-tuning the smaller model, Mistral-7b. That is, for customers who have text message as contact preference, if the model's response was under 25 words, let's use that data for a fine-tuning job. Similarly, for customers who have email as contact preference, if the model's response were more than 30 words, let's use that data as well. In the next cell, we save this data set into the Support Ticket Responses table in Snowflake. Let's check to make sure by using Select All from Support Ticket Responses table. Now, let us format this data set into three columns with Ticket ID, Prompt, and Mistral-Large Response. This data is now ready to run the fine-tune job on the Mistral-7b. Let's save this into Support Tickets Fine-Tune Message Style table. All good! Next, we will split this table into a training and an evaluation data set. In the next cell, we use the random split function to split the data. 80% of the data to be used for training and 20% to be used for evaluation. We set the seed equal to 42 to make sure that each time we run this code, the split will be the same. You could set the seed to any number, but keep it the same during every run. But I like to keep it 42 thanks to Hitchhiker's Guide to the Galaxy. All right! Our data is ready for training. Before we get started on that, let's look at some things we need to keep in mind before we run our fine-tune job. Cortex Fine-Tune requires that the training data be within the form of a table or a view within the Snowflake environment. We need to make sure that we have both the training and validation data ready and in a table or view that we can call. Even tens of data points can be sufficient to start. If you look at our fine-tuning data set, it has only about 60 rows. That's it. This is sufficient because, as I mentioned before, Cortex uses parameter-efficient fine-tuning under the hood. The parameter-efficient fine-tuning technique improves the performance of a pre-trained model for a specific task by fine-tuning only a small subset of its parameters. It is a good choice when you need to adapt a model for a domain-specific task while keeping costs and resources low. Perfect for our scenario here. The parameter-efficient fine-tuning starts by freezing the model's parameters and injecting a small number of adapter layers in between the frozen layers. Since we are not retraining the entire model, we get to leverage the model's already present knowledge as a foundation to layer the fine-tuning into. The model already has a general knowledge under its belt. Now we want to train it to generate responses in a style similar to our training data. That is, based on our customers' contact preferences. We have set up our environment and prepared our training data. Well done! Let's look at what we covered in this video. We looked at fine-tune and how we must prepare both the training and validation data. We learned how we must ensure that both have prompt and complete columns for the function to work. We looked at how this data must be in a snowflake table or a view. We randomly split our data into training and evaluation datasets and have it ready to go. We also examined how the fine-tune function in Cortex is implemented under the hood. And how it can fine-tune a model without having to retrain all the parameters of the foundation model. We also looked at an example of calling this function in a snowflake notebook. Next, we will be looking at how to start the fine-tune process. See you soon!

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.