Querying Unstructured Data: Hybrid Search

1. Querying Unstructured Data: Hybrid Search

Welcome back. One older way to search is through keyword search. Keyword search relies on exact matches between the words in our query and the text we're looking for. This is great for finding text that contains highly specific terms. The way the keyword search works is by leveraging sparse vectors. In keyword search, each vector will be the length of the number of unique words that are in your text corpus. Each element in the vector represents a unique word. Any word represented in the text will be captured with a 1, and all other elements in the vector will be 0. So when we look at these vectors, we will see that most, but not all, of the elements in the vector will be 0. Since the vector will be largely an array of zeros, we call it a sparse vector. While keyword search is great when we need to track down particular words, it can miss text using synonyms conveying a similar meaning that we might also want to retrieve. In contrast to keyword search, similarity search relies on dense embeddings that encode rich information about the meaning of the text. However, there are benefits to leveraging both sparse and dense vectors in search. If we take both types of search as building blocks, we can construct a hybrid search combining the two methods mentioned before. By combining both our dense and sparse vector search, we can make sure we turn over every stone in finding the text chunks we need. This way, we don't miss the text containing exact matches to the text we're looking for, and we can also find semantically similar chunks that we might be interested in. So when we get our search results, we might often want to go about it in a more exhaustive way. Because vector operations like search are relatively cheap, this allows us to get all of the search results we might want to use downstream. But if we really want to get only the most relevant text, we can order it by relevance. This is a technique known as re-ranking. Re-ranking is a type of model that, when given a query and document pair, will output a similarity score. Cortex Search uses this to order the result based on similarity. Once we've scored each document based on its similarity to the query, we can sort by relevance and remove any irrelevant chunks from our search result. By combining these steps, keyword search, similarity search, and re-ranking, we gain a robust ability to query our data and get valuable, accurate information to enter our queries. So let's talk about integrating these structures with LLMs. Using these structures with an LLM gives us pretty good search results, but we should talk about why we want these search results in the first place. If we ask an LLM a question on its own, it cannot directly query the information it needs to reliably answer our question. This is especially true if we are asking questions about our own private data. To give the LLM the tools it needs to accurately answer, we need to ground it with evidence. This is where the search results come in. After we've completed our search, or retrieval, the R in RAG, we can then augment the generation step with our search results. We'll talk about this in more detail later in the course. So in this module, we discussed a high-level overview of how we get answers from our unstructured data. It's important to understand the conceptual structure of how we search our unstructured data with hybrid search and re-ranking to build a highly performant search system. In the next video, we'll move back to discuss Texas SQL in more detail. See you then!

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.