Get startedGet started for free

RAG with text and images

1. RAG with text and images

Congrats on making it to the final chapter of the course - we've left the best until last!

2. What about the images?

In the PDF documents we're processed so far, we've essentially ignored the image content. Docling converts any images into an image placeholder, which can be useful, but all of the context captured in the image has been lost. In this video and the coming exercises, we'll treat each PDF page as an image, rather than breaking them into the sub-components of text and images.

3. RAG with images

The pipeline for performing RAG with images is virtually the same as a text-only RAG system. The only difference is that a multi-modal embedding model is required to embed both the text query and the image documents.

4. Multi-modal embedding models

This model, in essence, is able to embed text and images into shared representations, so an embedded image of a cat and an embedded version of the story "The Cat in the Hat" should appear next to one another in the vector-space.

5. Encoding images

We'll be using a ColPali model to do this. ColPali breaks images up into patches and turns each patch into a vector using a vision–language model, or VLM.

6. Encoding images

A text query is also broken into smaller pieces called tokens, and each token is turned into a vector in the same space. To find the best match,

7. Encoding images

the model compares every text token with the image patches, takes the best matches, and adds them up. The image whose patches align best with the query tokens is retrieved.

8. Augmenting and generating with images

To use this retrieved context for generation, we need a multi-modal generative model that accepts both text and image inputs. We'll be using a GPT model from OpenAI for this. This multi-modal generative model requires a prompt that integrates both the user input text and retrieved document image.

9. Let's practice!

Now it's time for you to try this out yourself!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.