Extracting text with PyPDF
PyPDF lets us extract text from PDFs, making it easy to work with multi-page documents like policy files.
In this exercise, you’ll load the US_Employee_Policy.pdf
, extract its content page by page, and combine it into a single string, preparing the text for a question-answering pipeline.
This exercise is part of the course
Working with Hugging Face
Exercise instructions
- Import the required class from
pypdf
and use it to load the PDF file. - Access each page and extract its content using the correct method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pypdf import ____
# Extract text from the PDF
reader = ____("US_Employee_Policy.pdf")
# Extract text from all pages
document_text = ""
for page in reader.____:
document_text += page.____()
print(document_text)