Troubleshooting deployed applications

1. Troubleshooting deployed applications

Welcome back. You can see the application now; let's fix it when it breaks. In this video, you'll learn to troubleshoot deployed applications. We'll cover the most common failure signatures for Lambda functions, code that runs without managing servers, query logs with CloudWatch Logs Insights, and debug service integration failures by reading the invocation from both sides. Let's get started.

2. Task timed out

Your phone buzzes: a function is failing, but only sometimes. The log says Task timed out after three seconds. Raise the timeout? Add memory? Add retries? Guess wrong and you make it worse. Most Lambda failures have a signature that points straight at the cause.

3. The common Lambda failure signatures

Most Lambda failures fall into a handful of recognizable signatures. A timeout shows Task timed out after N seconds, meaning the function exceeded its limit, often on a slow downstream call. A throttle shows TooManyRequestsException, a concurrency limit, concurrency being how many copies run at once. An IAM denial shows AccessDenied, a missing permission on the execution role. Out of memory shows the runtime killed mid-execution. Read the error message first, because the signature usually points straight at the cause.

4. First response to a failing function

When a function starts failing, resist changing settings at random. First, read the actual error message and match it to a known signature. A timeout usually means a downstream call got slow, so you check that dependency's latency rather than bumping the timeout blindly. The disciplined loop: read the error, form one hypothesis, change one thing, re-test. Change several at once and, even if it works, you won't know which fixed it.

5. CloudWatch Logs Insights

When logs get large, scrolling doesn't scale, so you query them. CloudWatch Logs Insights is a query language over your log groups. You use filter to narrow to the events you care about, like only errors, and stats to aggregate, like count by minute. So a single query can return the per-minute error count over the last hour, or group failures by status code. This turns a vague "errors went up sometime this afternoon" into a precise answer about exactly when and how much. It's one of the highest-leverage troubleshooting skills on AWS.

6. Debugging integration failures from both sides

Some of the hardest failures aren't inside one service, they're between two, like a Lambda that can't write to DynamoDB. The mistake is looking at only one side. You read the caller's log and the callee's log together, lining them up by a shared request ID or trace ID. Often the caller logs that it sent a request and the callee never logs receiving it, which tells you the break is in between, maybe a permission or a network path. Reading both sides is how you find integration bugs fast.

7. Health checks and readiness probes

For load-balanced and container endpoints, health checks tell the platform what's safe to send traffic to. A health check periodically asks each target if it's alive and serving, and the load balancer routes only to those that pass. A readiness probe asks if it's ready for traffic yet, which matters during startup. When a target fails its checks, the platform pulls it out of rotation, so users stop hitting the broken instance. This explains many why-did-traffic-stop mysteries.

8. A troubleshooting playbook

Let's tie it into a repeatable playbook. First, read the error and match it to a known signature so you have a hypothesis. Second, query with Logs Insights to scope the problem: how often, since when, which users. Third, if it's an integration, correlate both sides by request or trace ID to find where the request breaks. Fourth, change exactly one thing and verify. This ordered approach beats poking at settings, whether the culprit is a timeout, a permission, or a slow dependency.

9. Let's practice!

Let's go debug something. Your turn to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.