Loading HTML files for RAG
It's possible to load documents from many different formats, including complex formats like HTML.
If you're not familiar with HTML, it's a markup language for creating web pages. Here's a small example:
<!DOCTYPE html>
<html>
<body>
<h2>Heading</h2>
<p>Here's some text and an image below:</p>
<img src="image.jpg" alt="..." width="104" height="142">
</body>
</html>
In this exercise, you'll load an HTML file taken containing a DataCamp blog post webpage. The necessary classes have already been imported for you.
This exercise is part of the course
Retrieval Augmented Generation (RAG) with LangChain
Exercise instructions
- Use the
UnstructuredHTMLLoader
class to load thedatacamp-blog.html
file in the current directory. - Load the documents into memory.
- Print the first document's page content.
- Print the first document's metadata.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a document loader for unstructured HTML
loader = ____
# Load the document
data = ____
# Print the first document's content
print(____)
# Print the first document's metadata
print(____)