Get startedGet started for free

Integrating document loaders

1. Integrating document loaders

In this chapter, we'll discuss retrieval augmented generation, or RAG.

2. Retrieval Augmented Generation (RAG)

Pre-trained language models don't have access to external data sources - their understanding comes purely from their training data. This means that if we require our model to have knowledge that goes beyond its training data, which could be company data or knowledge of more recent world events, we need a way of integrating that data. In RAG, a user query is embedded and used to retrieve the most relevant documents from the database. Then, these documents are added to the model's prompt so that the model has extra context to inform its response.

3. RAG development steps

There are three primary steps to RAG development in LangChain. The first is loading the documents into LangChain with document loaders. Next, is splitting the documents into chunks. Chunks are units of information that we can index and process individually. The last step is encoding and storing the chunks for retrieval, which could utilize a vector database if that meets the needs of the use case. We'll discuss all of these steps throughout the next chapter, but for now, we'll start with document loaders.

4. LangChain document loaders

LangChain document loaders are classes designed to load and configure documents for integration with AI systems. LangChain provides document loader classes for common file types such as CSV and PDFs. There are also additional loaders provided by 3rd parties for managing unique document formats, including Amazon S3 files, Jupyter notebooks, audio transcripts, and many more. In this video, we will practice loading data from three common formats: PDFs, CSVs, and HTML. LangChain has excellent documentation on all of its document loaders, and there's a lot of overlap in syntax, so explore at your leisure!

5. PDF document loader

There are a few different types of PDF loaders in LangChain, and there is documentation available online for each. In this video, we'll use the PyPDFLoader. We instantiate the PyPDFLoader class, passing in the path to the PDF file we're loading. Finally, we use the .load() method to load the document into memory, and assign the resulting object to the data variable. We can then check the output to confirm that we have loaded it. Note that this document loader requires installation of the pypdf package as a dependency.

6. CSV document loader

When loading CSVs, the syntax is very similar, but instead we use the CSVLoader class. We're seeing a pattern forming!

7. HTML document loader

Finally, we can load HTML files using the UnstructuredHTMLLoader class. We can access the document's contents, again, with subsetting, and extract the document's metadata with the metadata attribute.

8. Let's practice!

Time to begin loading documents for RAG!