Log Insights and analysis

1. Log Insights and analysis

Welcome back! Your alarm fired and your team got notified, now what? You need to dig into your logs and find the root cause, fast. In this video, we'll master CloudWatch Logs Insights queries for real troubleshooting scenarios.

2. Logs Insights query structure

Logs Insights is a purpose-built query language for log analysis across one or more log groups. Five pipe-chained commands cover most tasks: fields, filter, stats, sort, and limit. But how do we query the data?

3. Query languages: same query, three ways

Logs Insights supports three query languages, all returning the same results. On the left, the native Query Language: pipe-delimited, with the richest log-specific features like parse and bin. On the right, OpenSearch Piped Processing Language, also pipe-based but following OpenSearch conventions. A third option, SQL, is next.

4. Query languages: OpenSearch SQL

The third option is OpenSearch SQL: the standard SELECT, FROM, and WHERE syntax many teams know, plus JOINs across log groups. All three express the same query, so pick whichever fits your team. Let's put that to work with our first scenario.

5. Scenario 1: Finding error spikes

First scenario: error spikes. Filter level = ERROR, grouped into five-minute bins to see when and how severe. Add error_type to tell a single dominant error from multiple types spiking; the latter suggests a broader infrastructure problem. What about performance? Let's tackle slow endpoints.

6. Scenario 2: Slow API endpoints

Filter on response_time > 1000 and aggregate average, max, and count by endpoint. Then drill down with request_id and user_id for the 20 slowest requests to see whether slowness affects all users or a subset. Errors and performance often point to security. Let's look at failed authentication.

7. Scenario 3: Failed authentication

Filter on login_failed and aggregate by user_id and ip_address; high counts from a single IP suggest brute force attacks. The time-based version bins by hour and filters for IPs exceeding ten attempts, surfacing concentrated attack windows. Database and memory issues have their own patterns. Two more scenarios.

8. Scenarios 4 & 5: Database timeouts and memory leaks

For database timeouts, use parse to extract the timeout duration, then aggregate count and average over five-minute bins to track onset and severity. For memory leaks, the bin function is like choosing a zoom level on a timeline: bin(1h) gives hourly granularity. Aggregate memory_used_mb sorted ascending; steady growth without GC recovery confirms a leak. Finally, tracing a request end-to-end.

9. Scenario 6: Request tracing

Filter on a specific request_id and sort ascending to trace the complete journey. Add service, action, duration_ms, and status to see which service handled each step, how long it took, and where it failed. Beyond basic queries, the Query Language handles unstructured data too.

10. Advanced techniques: parse, Regex

The parse command extracts fields from unstructured text using wildcard patterns. You can also use regex with named capture groups, essential for legacy apps that write free-text log lines. Calculated fields and conditional logic extend this without changing your logs.

11. Advanced techniques: Calculated Fields

Calculated fields derive values inline, like error rate as a percentage. The case function categorizes values into buckets like success, client error, and server error. And dot notation accesses nested JSON fields like user.email, all without changing application code. Fast queries matter when you're troubleshooting under pressure. Four rules to remember.

12. Query optimization

To prevent queries from timing out, do four things: always set a time range, filter before aggregating, limit raw results, and prefer aggregations over sorting raw records. These are the difference between a two-second result and a timeout. The final technique combines correlation and statistical anomaly detection.

13. Anomaly detection and correlation

For anomaly detection, calculate a baseline average and standard deviation over seven days, then compare the current period against it: anything beyond two standard deviations is flagged. This catches subtle degradations that fixed thresholds miss. You can also calculate error rate over five-minute windows. A constant rate as traffic grows suggests a code bug; a rate that spikes only at peak suggests a resource limit, connection pools, thread pools, or rate limiting.

14. Summary

To recap: Logs Insights is a powerful interactive query service with no infrastructure to manage and three language options. We walked through six troubleshooting scenarios covering errors, performance, security, and distributed tracing. Advanced features like parse, bin, and pct handle any log format. And once you've found a pattern manually, automate its detection with metric filters and alarms, closing the loop from investigation to proactive monitoring.

15. Let's practice!

Let's look at what we learned in this video.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.