1. Document Q&A
Welcome back! In this video, we’ll explore question-answering models, and use them to have conversations with documents. Let’s dive in!
2. What is document question and answering?
Document question answering, or Document QA, involves generating answers to questions about the contents of a document or text passage.
This task requires two inputs: a document, typically a PDF file, and a question. The document could be a research paper, contract, user manual, or similar text-based file. The question is a text string asking something specific, like "What is the total revenue of Q3?"
The answer is generated by analyzing the content and can be a direct quote or a paraphrased response.
3. Use cases for document Q&A
Document QA is widely used to automate data extraction and analysis across industries.
In the legal sector, it identifies clauses in contracts, such as termination terms.
In finance, it extracts key figures like revenue and expenses from reports.
In customer support, it retrieves answers to common questions from manuals or FAQs.
4. Automating HR queries with document Q&A
Next, let’s introduce our case study. Imagine you’re an ML engineer at a large company. Your HR team is overwhelmed with questions about holidays, notice periods, and other policies.
This information is stored in a multi-page document called US-Employee_Policy.pdf, making it time-consuming to find answers manually.
Using Hugging Face, we’ll build a system to retrieve specific answers from the document, saving HR hours and streamlining employee communication.
5. Extracting text with pypdf
To start, we'll load the PDF file using the lightweight pypdf library and its PdfReader function.
We load the file by specifying the path, US-Employee_Policy.pdf, and use the .pages attribute to access all pages. Using a loop, we iterate through each page and call the .extract_text() method to extract its content. The text from all pages is appended to a variable, combining them into a single string and preparing the PDF for processing.
Now, let’s move on to creating our Q&A pipeline.
6. Creating a Q&A pipeline
For this, we’ve selected the question-answering task with the distilbert-base model—a lightweight and efficient choice for Q&A tasks.
We crafted a question and passed it to the pipeline, along with the extracted text as the context parameter.
The pipeline returned the correct answer: 1, confirming the policy allows one volunteer day annually.
7. Bringing it all together
In this video, we’ve combined all the pieces to automate document question-answering.
Using PdfReader from pypdf, we extracted text from the pdf document by iterating through .pages and using .extract_text() to create the document_text.
Next, we set up a pipeline with task="question-answering", passed a question, and provided the extracted text as the context. The result? Accurate answers to specific questions from the document.
Finally, as a next step, we can wrap this pipeline into reusable functions, allowing users to ask their questions directly, saving HR time to focus on building an amazing company culture!
8. Let's practice!
We’ve covered so much—time to put it into practice. Keep moving forward!