Lexical graphs

1. Lexical graphs

In the last chapter,

2. Recap

we built a text-to-Cypher chain to retrieve information from a knowledge graph and integrate it with an LLM. In this chapter, we'll optimize this approach by focusing on the knowledge graph itself, looking at how different graph models can improve Graph RAG retrieval. We'll also learn how to combine knowledge graphs and vector embeddings to get the best of both worlds!

3. Lexical graphs

In Graph RAG, the multi-level hierarchy that represents the logical structure of a document is referred to as a lexical graph. Documents can be naturally sub-divided into pages, and the raw text from each page can be split into chunks.

4. Romeo and Juliet

Throughout this chapter, we will explore Shakespeare's famous play, Romeo and Juliet. This text is split into sections including an overview and licensing information, character information, and a prologue followed by five acts each containing multiple scenes. Notice how each act and scene is clearly numbered with Roman numerals.

5. Romeo and Juliet as a knowledge graph

The play can be split into five Acts,

6. Romeo and Juliet as a knowledge graph

each act will have relationships to many scenes,

7. Romeo and Juliet as a knowledge graph

and each scene consists of many lines.

8. Romeo and Juliet as a knowledge graph

Later in the course, we will also create a relationship to the Character that has spoken the line.

9. Romeo and Juliet as a knowledge graph

The lines have their own semantic meaning, and they are small enough that they can be used for semantic search.

10. Loading the document

LangChain offers several document loaders for reading a document and loading the text into memory. We are using the PyPDFLoader and its .load() method to load the raw text from a PDF document.

11. Splitting the document

To create the hierarchy of the lexical graph, we start by using RecursiveCharacterTextSplitter to create the acts. We split the text on paragraphs that start with "The Prologue", "Act" or asterisks followed by the word "END". We use regular expressions, or regex, to write these patterns, which is why each string has an "r" before the opening quote, and we need to specify is_separator_regex=True. With our acts created, a second RecursiveCharacterTextSplitter will be needed to split the text for each Act into scenes. We use regex to specify the split should be made on the word "Scene".

12. Creating nodes and relationships

Each Act node should have a relationship to a node at the top of the hierarchy representing the play. This will have a type of Play, a unique ID of romeo-and-juliet, and a dictionary of properties that describe the play. We can then create a GraphDocument to store the nodes and relationships that we create as we iterate through the acts and scenes.

13. Extracting acts

We use the act_splitter to split the text into acts. The act_splitter splits the text into the prologue and act 1, 2 and 3.

14. Extracting acts

We loop over the acts, and check if the first line of the text begins with the word ACT. If it does, we create a new node with the type Act, using the first line as the unique ID and append it to the GraphDocument.

15. Extracting acts

Then, create a new relationship from the play node to the act node with a type of HAS_ACT. We use the index variable from the loop to set an order property, which is useful information to have. This relationship is then appended to the list of relationships in the graph document.

16. Saving nodes and relationships

Finally, we use the .add_graph_documents() method to merge the nodes and relationships into the graph. In the next video, we'll add embeddings to the mix, but for now,

17. Let's practice!

let's practice building out the hierarchical lexical graph!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.