CloudWatch alarms and notifications

1. CloudWatch alarms and notifications

Welcome back! You've got metrics and logs flowing into CloudWatch. But what good is data if nobody's watching? In this video, we'll set up alarms that detect problems and wake up the right people automatically.

2. What are CloudWatch Alarms

At its core, a CloudWatch Alarm watches a single metric and performs actions when it crosses a threshold. It has six core components: the metric, threshold, the time interval per evaluation, evaluation periods, data points to alarm, and the actions on state change. Every alarm is always in one of three states.

3. Alarm States

The three states are OK, ALARM, and INSUFFICIENT_DATA. Alarms start in INSUFFICIENT_DATA and transition to OK or ALARM as data arrives. Now let's see how CloudWatch decides when to change state.

4. Alarm evaluation

Evaluation is a five-step process: collect data points, apply a statistic, compare to threshold, count breaching periods, and change state if the criteria are met. Think of it like a three-strikes rule: two out of three bad check-ins and it's time to act. Your chosen strategy affects how sensitive the alarm is to spikes.

5. Evaluation strategies

There are three evaluation strategies: consecutive (all must breach, fewer false positives), partial (some can be OK, balanced), and single breach (immediate, highest sensitivity) for critical metrics needing instant action. Alarms have four action types, let's look at those next.

6. Alarm action types and triggers

Actions can trigger on any of the three states: OK, ALARM, or INSUFFICIENT_DATA. Each supports four action types: SNS notifications, Auto Scaling policies, EC2 actions like stop, terminate, reboot, or recover, and Systems Manager automations. Before creating an alarm, we decide how to handle gaps in the data.

7. Missing data behavior

Missing data handled badly means false positives or missed issues. It's like deciding what to do when a security guard doesn't check in: assume all's fine (notBreaching), sound the alarm (breaching), or keep the current status (ignore)? Let's create a standard alarm using the CLI.

8. Creating a standard alarm: AWS CLI

We create an alarm via the CLI using put-metric-alarm, specifying the name, metric, namespace, and trigger configuration. Here we configure it to send to an SNS topic. A single alarm watches one metric. But to combine conditions, we use composite alarms.

9. Composite alarms: what and why

Composite alarms solve alert fatigue, like a circuit breaker panel evaluating overall state rather than each fuse triggering independently. Combine alarms with AND, OR, and NOT. Complex expressions like high error rate OR high latency, AND NOT maintenance mode, suppress alarms during planned work.

10. Creating composite alarm: AWS CLI

We create composite alarms with put-composite-alarm. The key parameter is alarm-rule, an expression referencing child alarms wrapped in the ALARM function. The complex example combines OR and AND with NOT. Now, how do we choose the right threshold?

11. Threshold selection strategy

There are four threshold strategies: baseline deviation using averages and standard deviation, capacity-based hard limits, SLA-aligned targets (your service-level agreement commitments), and rate of change to catch emerging trends. This enables a multi-tier approach routing different severities to different teams.

12. Multi-tier alarm strategy

A three-tier graduated response is a good place to start. Warnings go to email or Slack, critical pages the on-call engineer, emergency pages multiple teams, each tier on a separate SNS topic. For metrics without predictable thresholds, CloudWatch can learn what normal looks like automatically.

13. Anomaly detection alarms

Anomaly detection uses ML to build a dynamic threshold band around normal behavior, via the ANOMALY_DETECTION_BAND expression with a configurable standard deviation width. It adapts to daily and weekly patterns automatically, no manual tuning needed. Let's apply this to the AWS services you'll alarm on in production.

14. Resource alarms: Lambda and ALB

For Lambda, watch error count, throttles, and high duration approaching the timeout limit. For ALB, the Application Load Balancer, watch target response times, unhealthy hosts, and 5xx error counts. Finally, practices to keep alarms well-managed as your architecture grows.

15. Alarm management recommended practices

At scale: consistent naming, descriptions with runbook links, tags for environment and team, monthly reviews. Always test with set-alarm-state to confirm notifications reach the right people.

16. Video summary

To recap: three alarm states, three evaluation strategies, composite alarms for noise reduction, four threshold approaches, multi-tier alerting for graduated response, and practical alarms across common AWS services. Next, connecting alarms to notifications with SNS and SQS.

17. Let's practice!

Now let's see how alarms and notifications work.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.