Designing fault-tolerant and resilient applications on AWS

1. Designing fault-tolerant and resilient applications on AWS

Welcome back! Now lets dive into how to design fault-tolerant and resilient applications.

2. Building for resilience

As you build applications that use AWS services, you will need to understand how to architect for failure and design applications that can handle failure gracefully.

3. Architect for failure

Resilience begins with a simple assumption: everything will fail.

4. Temporary failures

To architect for resilience, you first need to understand what might go wrong. Here are the common failure modes: Temporary errors are failures that go away on their own. These are usually safe to retry.

5. Timeouts

Timeouts occur when external services take too long to respond. A timeout does not always mean the request failed on the other end, just that you stopped waiting

6. Permanent errors

Permanent errors are failures where the request is fundamentally broken. Retrying will not fix these

7. API limits

Exceeding API rate limits happens when you send too many requests to an API that has a defined limit on how many requests you can send per second. HTTP 429 status codes typically mean "Too Many Requests.".

8. Retry strategies

To mitigate temporary errors, add retry logic. Failures in distributed systems are often temporary. A network timeout or a throttled request may succeed if attempted again after a short delay. Think carefully when adding retry logic. Blindly retrying might make issues worse by adding load to the system. Strategies that will help you include: implementing exponential backoff to increase the delay between retries, using Jitter to add randomness to avoid retry spikes, and using retry limits prevent infinite loops.

9. AWS SDK native capabilities

AWS SDKs handle much of this automatically, which is why using them correctly is critical. AWS SDKs include built-in retry logic with exponential backoff and jitter, meaning they automatically retry failed or throttled requests after increasing delays. This reduces the amount of custom error-handling code you need to write.

10. Managing retry logic

Retry logic works great for some errors, but will not be effective in the following scenarios: If you get HTTP 4xx Errors, these are your fault, not the server's. The server understood you perfectly and rejected you. Requests that are not idempotent, this can lead to inconsistent behavior.

11. Managing timeouts

Retrying a failing request without limits can amplify problems, especially under high load. This is why retries are typically combined with timeouts, ensuring that requests do not wait indefinitely for a response. A timeout is a maximum amount of time you are willing to wait for a response. Every call to an external service should have a timeout.

12. Circuit breakers

When a service is consistently failing, retries alone are not enough. Continuing to send requests can overload the failing service and degrade the entire system. The circuit breaker pattern addresses this by temporarily stopping requests to an unhealthy dependency. - In the closed state, requests flow normally

13. Circuit breakers

If failures exceed a threshold, the circuit opens and subsequent requests are blocked.

14. Circuit breakers

After a cooldown period, the circuit enters a half-open state to test recovery. If the service responds successfully, normal operation resumes. If not, the circuit remains open. While AWS SDKs primarily focus on retries and backoff, circuit breaker behavior is typically implemented at the application level or through libraries and service meshes.

15. Dead letter queues

When a message continues to fail after multiple retry attempts, it might block or fail the system. It is often better to move it out of the main processing flow. This is where dead letter queues (DLQs) are used. Dead letter queues are an important part of building resilient applications because they provide a controlled way to handle persistent failures. Dead letter queues are implemented across a number of different AWS services, so understand the trade-offs and constraints.

16. AWS API limits

When building applications that interact with AWS services, you are accessing APIs. AWS services each define their own limits on how frequently you can make API calls. If your application exceeds these limits, requests may be throttled typically returning errors such as `429 Too Many Requests` As a result, your application must be designed to handle throttling gracefully.

17. Integrating with third-parties services

Third-party services introduce additional uncertainty because you have no control over their performance or availability. You need to handle those situations gracefully. - Set timeouts to prevent long waits - Use retries with backoff for temporary issues - Isolate dependencies so failures do not spread

18. Integrating with third-parties

Consider a shift from synchronous to asynchronous communication, allowing your system to continue processing even if the external service is delayed.

19. Let's practice!

Now let's practice what we've learned about building resilient applications!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.