Get startedGet started for free

PDF text extraction and chunking

This exercise is part of the course

End-to-End RAG with Weaviate

View Course

Exercise instructions

  • Run the code provided to process the PDF documents using docling and parse them as markdown files.
  • Define a get_chunks_by_length_with_overlap() function to chunk md_txt using a 500 character chunk length and 100 character overlap.
  • Define a get_chunks_using_markers() function to chunk md_text_1 by splitting on non-title headings ("\n##").
  • Apply the get_chunks_using_markers() function to md_text_2 and compare the results to md_text_1.


Note: If you’re running DataLab in Restricted Mode, this exercise isn’t supported yet. We’re actively working on making it available in the future.


Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

Start Exercise