PDF text extraction and chunking
This exercise is part of the course
End-to-End RAG with Weaviate
Exercise instructions
- Run the code provided to process the PDF documents using
doclingand parse them as markdown files. - Define a
get_chunks_by_length_with_overlap()function to chunkmd_txtusing a500character chunk length and100character overlap. - Define a
get_chunks_using_markers()function to chunkmd_text_1by splitting on non-title headings ("\n##"). - Apply the
get_chunks_using_markers()function tomd_text_2and compare the results tomd_text_1.
Note: If you’re running DataLab in Restricted Mode, this exercise isn’t supported yet. We’re actively working on making it available in the future.
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
Start Exercise