The text-to-query workflow

1. The text-to-query workflow

Welcome back! With your cluster set up, let's look at the text-to-query workflow in a bit more detail.

2. Text-to-query

Given a natural language user input, a text-to-query application first needs to identify which collection has the most relevant information to answer the question, since typically, your database will have multiple collections.

3. Text-to-query

Then, it needs to determine the schema of the documents in the identified collection.

4. Text-to-query

Using this information, the application can translate the natural language user input into a valid MongoDB query.

5. Text-to-query

It's good practice to also validate the generated query before executing it.

6. Text-to-query

And finally, the query is executed against the identified collection to return a result.

7. Text-to-query

The result is then passed to an LLM along with the user input to generate a context-aware response. In this chapter, we'll write the MongoDB query ourselves to get a feel for the syntax, use it to query the database, and insert the query results into an LLM prompt to answer a question.

8. Text-to-query: the agentic approach

In the next chapter, we'll use an AI agent to dynamically perform these tasks by using a set of tools to read and query information from the database.

9. Data: the MongoDB way!

In MongoDB, data within a database is organized in collections. Collections are to MongoDB what tables are to relational databases. Collections consist of flexible JSON-like documents that can support varying key-value pairs or even nested structures, instead of rows with fixed columns like you'd have in a relational database.

10. MongoDB queries

In MongoDB, queries are expressed using your programming language of choice. Instead of writing SQL statements, you pass JSON-like objects as parameters to methods such as .find(), .insert(), .update(), and .delete(). This method-based approach is why it's called a Query API, rather than a query language.

11. Example: movies collection

Let's look at some common query operations using the MongoDB Query API on the pre-built movies database that was loaded into our MongoDB cluster at creation time. The database consists of several collections, but let's pick the movies collection and run some queries against it. This collection consists of movie information, such as release year, cast, genres, etc. Let's start with filtering the data for romance movies.

12. Single filters

To write this MongoDB query, we create a Python dictionary where the field name 'genres' is the key, and we map it directly to the value 'Romance' to find movies that have Romance as one of their genres.

13. Single filters

Let's level this up with multiple filters! Say we want to find romance movies from the 90s.

14. Multiple filters

To filter on multiple conditions, we use the $and operator, which houses a list of conditions expressed as dictionaries, as before. Here we use the $gte and $lt operators to check that the year is greater-than-or-equal to 1990 and less-than 2000.

15. Multiple filters

We can extend this list to add more conditions, or use $or instead of $and to indicate that only one condition needs to be met in the filtering.

16. Aggregation pipelines

You can also perform complex data manipulations, such as grouping, sorting, transforming, and joining data across collections using the MongoDB Query API. This is done using aggregation pipelines. Aggregation pipelines in MongoDB consist of a series of stages, each transforming the data in some way and passing the results on to the next stage.

17. Multiple operations

Let's say we want to retrieve the top 5 latest movie releases from the dataset. To do this, we need to first sort by the released field in descending order, then limit the results to the top five, and finally, project only the title field. We start by opening a list containing the three operations,

18. Multiple operations

then use the $sort key to indicate we want to start with a sort operation. We specify the column to sort on, and the value -1 indicates it should be sorted in descending order.

19. Multiple operations

We use $limit to limit only to the top five results.

20. Multiple operations

Finally, we use $project to keep only the title field. The 1's and 0's indicate what will be kept in the final result. Here, we remove the default "_id" column by setting it to zero.

21. Multiple operations

Let's tackle one last query: a grouped aggregation.

22. Grouped aggregations

Let's say we wanted to find how many movies of each genre are present in our dataset. Since genres is an array, we first need to unwind it.

23. Grouped aggregations

And we do that using the $unwind operator. This will create a separate document for each element in the genres array. Next, we use $group to indicate we want to group on the genres field, and use the $sum accumulation operator to aggregate the total number per genre. This aggregation will be captured in a numMovies field. We can also apply a filter to the aggregated field using the $match keyword. We use $gte to indicate a greater-than-or-equal-to 50 filter.

24. Grouped aggregations

This is what the result of this aggregation would be. We have genres and their movie counts, and we've only retained genres that have more than 50 movies. The MongoDB query API supports lots of different transformations, but as we'll eventually be letting the LLM build the queries for us, we'll leave the rest to another day.

25. Let's practice!

Time to get querying!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.