Splitting HTML
In this exercise, you'll split an HTML containing an executive order on AI created by the US White House in October 2023. To retain as much context as possible in the chunks, you'll split using larger chunk_size
and chunk_overlap
values.
All of the LangChain classes necessary for completing this exercise have been pre-loaded for you.
This exercise is part of the course
Developing LLM Applications with LangChain
Exercise instructions
- Create a document loader for
white_house_executive_order_nov_2023.html
, and load it into memory. - Set a
chunk_size
of300
and achunk_overlap
of100
. - Define the splitter, splitting on the
'.'
character, and use it to splitdata
and print the chunks.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the HTML document into memory
loader = ____
data = ____
# Define variables
chunk_size = ____
chunk_overlap = ____
# Split the HTML
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=____)
docs = ____
print(docs)