Splitting semantically
All of the splitting strategies you've used up to this point have the same drawback: the split doesn't consider the context of the surrounding text, so context can easily be lost during splitting.
In this exercise, you'll create and apply a semantic text splitter, which is a cutting-edge experimental method for splitting text based on semantic meaning. When the splitter detects that the meaning of the text has deviated past a certain threshold, a split will be performed.
This exercise is part of the course
Retrieval Augmented Generation (RAG) with LangChain
Exercise instructions
- Instantiate the
'text-embedding-3-small'
embedding model from OpenAI. - Create a semantic text splitter that uses vector gradients to determine semantic similarity and uses
0.8
as the threshold at which to split. - Split the
document
using the semantic splitter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Instantiate an OpenAI embeddings model
embedding_model = ____(api_key="", model='____')
# Create the semantic text splitter with desired parameters
semantic_splitter = ____(
embeddings=____, breakpoint_threshold_type="____", breakpoint_threshold_amount=____
)
# Split the document
chunks = ____
print(chunks[0])