Failing Gracefully: Mitigating Risks

1. Failing Gracefully: Mitigating Risks

Unfortunately, even if you followed every detail of this course and successfully put an agent into production, failures will still happen. Let's look at the most common failure points and their solutions, starting with tool call failures.

2. Tool call failures

This is where the model has issues with sending the correct parameters to the tool, understanding the outputs of the tools, or general integration problems such as authentication or service downtime. Defining clear parameters and usage guidelines for tools, implementing validation checks for tool outputs, and creating a verification layer to confirm the correct tool selection are all important steps to mitigate any errors with tool calling. Protocols like MCP also mitigate risks with incorrect tool calls if configured correctly.

3. Tool call failures - retry mechanism

Some tools, particularly ones that require calls to external APIs, may encounter periods where the service is busy and requests take longer than usual. To mitigate disruption to our agent, we can use a retry mechanism on tool calls with a backoff strategy. When an unsuccessful request to this tool is made, it retries in a slowly decreasing rate. This should eventually return the expected tool output without overloading the system we're trying to reach. In the meantime, we can include a callback to inform learners that the service is experiencing heightened traffic, and to expect a short delay.

4. Tool call failures - caching

Another option is caching, which is where responses from tools are stored and used as a fallback option in the event that the service is unavailable. This is especially good for more static data that will not receive frequent updates, such as retrieving historical data or documentation. However, for more dynamic cases, such as retrieving real-time stock prices or weather information, caching won't solve our problem, so we'll have to stick with the retry mechanism.

5. Tool call failures - queue management

Queue management is also incredibly important for mitigating tool call failures. Some calls may be reliant on each other - think back to the corporate travel example. The airport transport agent was completely reliant on the results of the flight booking agent. In these cases, we need to set up an intelligent queuing system so that agents can be triggered as soon as they can operate. Additionally, in the case of a busy API, instead of retrying the API when it's down, we can program it to back off and move down the queue until other operations have been performed; that way, it isn't holding them back.

6. Authentication

Tools and data may require specific permissions to perform their operations. This may prevent our agents from accessing different information and performing the actions requested by users. For example, our IT support agent may need the logs of the user's device to troubleshoot their issues. If given open access to the user's device, it may leak sensitive data like the user's current location or personal account data. To prevent these issues, here are some strategies to mitigate these risks:

7. Authentication - unique agent identifiers

Using Unique Agent Identifiers. This assigns distinct identities to each agent, which allows us to create granular roles, specific permissions, and a clear audit trail of their actions.

8. Authentication - isolated environments

Creating isolated environments: This restricts the agent's interactions to only the specific data elements or systems it is intended to access.

9. Authentication - guardrails and action restraints

Clear Guardrails and Action Constraints: Define strict limits on what an agent can do. For example, an IT support agent handling a password reset shouldn't be able to initiate a full account deletion.

10. Let's practice!

Now let's apply these concepts by completing the last set of exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.