Automatic Processing of New Documents

1. Automatic Processing of New Documents

Welcome back. So in the last video, we diagnosed a problem. The RAG is using stale data. So how do we fix this? If we go back to when we first talked about Cortex-Search, you might remember that it can automatically handle refresh and re-indexing as fast as one minute, or whatever you set the target lag parameter to be. This means that the Cortex-Search service itself is as fresh as our data. But what happens when our data gets stale? Well, we need to update it, and we need some way to automate the process from raw PDFs to parsed and chunked data indexed by Cortex-Search, so we're not doing all of this manually. And that's where tasks and streams come in. Anytime data is added to stage, we can create a stream that captures changes to the stage. This stream tells us exactly what has changed. Then when the stream happens, we can use a task to run transformation on the new data, like parsing and chunking. To make this happen, the first stream we will create will watch for changes in our stage. I'll call this FOMC doc stream. Then I'll create a task to transform the PDF in stage into parsed text. I'll use the same code we wrote before, and it'll run on a schedule of once a minute whenever new data is available in the stream. Next, I'll create a second stream to check when new parsed data is available. We'll call this FOMC parsed stream. And I'll create a second task to chunk this data. Again, same code, just now it's contained in a task that watches our parsed stream. Once I've created the task, I need to resume them. This is because when created, tasks start as suspended. And now I can upload new PDFs to my stage and let the magic happen. Let's try it. Here, I'll upload a new PDF to my stage. I'll upload a new set of PDFs to stage. Now I'll go over to watch the tasks in action. Now I'll go over to watch the tasks in action. I can do that by clicking the Database Explorer and then into Tasks. Then I click on the parsing task and look at the run history. And I'll wait just a few seconds for it to show up here. Once it's done, I can go check the chunking task too. And when that's done, I just need to wait for the Cortex search service to re-index the new data. Now let's go back to the notebook to try it. We get a useful answer out. The RAG is now fresh. In this video, we learned how to use tasks and streams to keep our Cortex search service and RAG fresh. This mirrors the journey we would take from prototype to production for any LM app. We started with some experiments, we iterated to improve our app, and we made sure our app has a fresh data source so it can remain high quality when it's in the hands of the users. Now that the backend is in good shape, let's add on the frontend. On to the next one.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.