Get startedGet started for free

Loading and splitting code files

1. Loading and splitting code files

In this chapter, we'll dive deeper into building more sophisticated RAG architectures with LangChain, starting with methods for loading and splitting code files.

2. More document loaders...

We've previously looked at a few common file formats, including PDFs, CSVs, and HTML files. Let's extend this a little further to Python and Markdown files, which are both common in software engineering and data science projects.

3. Loading Markdown files (.md)

Let's start with Markdown files. Markdown is a lightweight markup language for creating formatted documents, and it's often the tool of choice for writing code documentation.

4. Loading Markdown files (.md)

When rendered, markdown can show links, images, code blocks, and much more.

5. Loading Markdown files (.md)

The UnstructuredMarkdownLoader class can be used to load markdown files the same way as other file formats we've looked at before: by instantiating the class on the file path, and using the .load() method to load it into memory. We could integrate these documents into a RAG application to read code documentation and make recommendations.

6. Loading Python files (.py)

Now onto Python files. Imagine we have a codebase and would like to have a way to talk with it and ask it questions about it. We could achieve this by integrating Python files into a RAG application. The PythonLoader class and the .load() method can be used to load these files into memory. The resulting documents have .page_content and metadata attributes for accessing the document's details. Remember parsing Python files can be tricky, because it has its own syntax with imports, classes, functions and much more that need to be preserved during chunking. Let's see how we can address that.

7. Splitting code files

Let's start splitting this Python file with our current best tool for document splitting: RecursiveCharacterTextSplitter. We set a chunk_size and chunk_overlap to control the number of characters in each chunk. Splitting the documents with .split_documents() method, we can print the content of each chunk.

8. Splitting code files

Here's the result. Notice that the split between chunks two and three splits the Anthropic class, and because chunks are processed separately, key context has been lost. Our current strategy is naive because it doesn't consider structures like classes and functions. Let's change this!

9. Splitting by language

Let's split our loaded Python file using RecursiveCharacterTextSplitter again, but this time, let's use the .from_language() method. This method has a language argument, which refers to coding languages, that we can set to Language.PYTHON, and the rest of the arguments stay the same. This will modify the default separators list from the hierarchy of paragraphs, sentences, and words, to try splitting on classes and function definitions before moving on to the standard separators. Now let's call .split_documents() to perform the split, and view the first three chunks again.

10. Splitting by language

As we can see, the splitter was able to split on class definitions, so all of that context is kept together. Note that this approach isn't final, and depending on the size of the classes and functions relative to the chunk_size, we may get differing results.

11. Let's practice!

Now let's see what we can do!