Splitting HTML
In this exercise, you'll split an HTML containing an executive order on AI created by the US White House in October 2023. To retain as much context as possible in the chunks, you'll split using larger chunk_size and chunk_overlap values.
All of the LangChain classes necessary for completing this exercise have been pre-loaded for you.
Deze oefening maakt deel uit van de cursus
Developing LLM Applications with LangChain
Oefeninstructies
- Create an
UnstructuredHTMLLoaderforwhite_house_executive_order_nov_2023.html, and load it into memory. - Set a
chunk_sizeof300and achunk_overlapof100. - Create a
RecursiveCharacterTextSplittersplitting on the'.'character, and use the.split_documents()method to splitdataand print the chunks.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Load the HTML document into memory
loader = UnstructuredHTMLLoader(____)
data = loader.____()
# Define variables
chunk_size = ____
chunk_overlap = ____
# Split the HTML
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=____)
docs = splitter.____(data)
print(docs)