Splitting the play into Acts
Converting unstructured text into hierarchical lexical graphs is an iterative process that involves building splitters for each lexical entity and then splitting by each in turn.
In this exercise, you'll design a splitter to split the play, Romeo and Juliet, into acts. Here is a preview of the structure of the play:
The Project Gutenberg eBook of Romeo and Juliet
This ebook is for the use of anyone anywhere in the United States...
**PROLOGUE:**
Enter Chorus.
CHORUS.
Two households, both alike in dignity...
ACT I
SCENE I. A public place.
Enter Sampson and Gregory armed with swords and bucklers.
SAMPSON.
Gregory, on my word, we’ll not carry coals...
...
Este exercício faz parte do curso
Graph RAG with LangChain and Neo4j
Instruções do exercício
- Update the
splitters
argument to also split the text on the pattern\n\nACT
. - Configure the
act_splitter
to treat theseparators
list as regular expressions. - Split
romeo_and_juliet
usingact_splitter
.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
act_splitter = RecursiveCharacterTextSplitter(
separators=[
r"\n\nTHE PROLOGUE.",
r"\n\n\*\*\* END",
# Split by the word ACT
r"____"
],
# Configure the patterns as regular expressions
____=True
)
# Split the play using act_splitter
acts = act_splitter.____(____)
for act in acts:
print(act.strip().split("\n")[0])