Parsing and Chunking Text

1. Parsing and Chunking Text

Hello again. Imagine for a second that I am back in my office and I've just finished uploading the council meeting minutes to stage. First, I need to parse the text from the raw PDFs. Once I have the text in hand, I can go about breaking up the data into searchable chunks. This is a useful technique in RAG settings where we can feed the LLM only the information it needs without too much extra fluff that could confuse the model and incur unnecessary costs. Shall we start? In this video, you'll learn how to parse and split your documents. After that, you will do this in your own notebook in the next part of the hands-on practice. Within the Snowflake environment, we can use the parse document task specific function, which returns the extracted content from the documents at stage. I've left a link to the documentation in the reading following this video. So to follow along, go back to where you were in the notebook in the last video. If you left your notebook, you may need to get your active session again. We left off the last video with raw PDFs in stage that need to be parsed so that CortexSearch can query them. Let's do that now. For this, you'll create a new table called ParsedFOMCContent to store the parsed text using a query. The first column we'll add is the relative path, and then the second we'll use the parse document SQL function. Our parse document functions will take the location of our stage, the relative file path, and the extraction method we want to use. The extraction mode can be set to optical character recognition, also known as OCR, or layout. The default mode is OCR, and it's relatively inexpensive. The other method we can use is layout mode. Layout mode is optimized for extracting text and layout elements like tables. It's also really great for RAG because it considers the semantic structure of the documents, so we'll use it here. We'll run the cell and then select from the top two rows to make sure it worked. Great. Everything is ready to move on. So now that we have our documents in a readable format, we need to split or chunk the data. The simplest way to chunk our data is by counting characters, or even we could chunk by counting tokens. But to level up, we can consider smarter ways to split text that make sense for a RAG setting. One of these ways is to use split text recursive character. This is useful because it splits the text into recursively shorter strings, and using this we can choose if we want to separate unmarked on elements, along with other separators like new lines. If you want to learn more, I've left a link to the documentation for text splitting in the reading following this video. Here, you create the chunked FOMC docs table and insert into it the file name and chunk from parsed FOMC content, where we'll specify the markdown formatting, chunk size of 1800, and overlap of 250. Now, how did we choose that chunk size? Chunk size is determined by the parameter that we pass to our chunking function. We pass the chunk size, which is an integer. Depending on the data, larger or smaller chunk sizes may be optimal, and there's no one-size-fits-all or easy rule of thumb. Experimentation is great here. Once we have decided on the size of the chunks, we must also think about the overlap. We set the overlap in integer to ensure that each chunk has context about the previous chunk. All of this we pass to the chunker. The function then splits on text separators first, and then recursively continues to split on each chunk until all chunks are below the chunk size limit that we specified. Before we finish, we should check that this worked by executing select all on chunked FOMC docs. Looks good. Now that you have your data uploaded, parsed, and chunked, you can create the search service. We will set your Cortex search service to look at the chunks we just created. Nice work so far, and I'll see you in the next video.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.