Get startedGet started for free

Loading HTML files for RAG

It's possible to load documents from many different formats, including complex formats like HTML.

If you're not familiar with HTML, it's a markup language for creating web pages. Here's a small example:

<!DOCTYPE html>
<html>
<body>
  <h2>Heading</h2>
  <p>Here's some text and an image below:</p>
  <img src="image.jpg" alt="..." width="104" height="142">
</body>
</html>

In this exercise, you'll load an HTML file taken containing a DataCamp blog post webpage. The necessary classes have already been imported for you.

This exercise is part of the course

Retrieval Augmented Generation (RAG) with LangChain

View Course

Exercise instructions

  • Use the UnstructuredHTMLLoader class to load the datacamp-blog.html file in the current directory.
  • Load the documents into memory.
  • Print the first document's page content.
  • Print the first document's metadata.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a document loader for unstructured HTML
loader = ____

# Load the document
data = ____

# Print the first document's content
print(____)

# Print the first document's metadata
print(____)
Edit and Run Code