Analyzing traces and service maps

1. Analyzing traces and service maps

Last lesson we instrumented code and started generating trace data. Now we make sense of it, with two core skills: reading service maps to understand your architecture, and using trace timelines to diagnose performance.

2. What is a service map?

A service map is generated automatically from trace data. Think of it like the live departure board at a railway station: at a glance you see which routes are on time, delayed, or stopped. Access it from the X-Ray console under Service map, apply filters, then click a node for a detailed breakdown.

3. Service map color coding

Service maps are color coded. Green means success. Yellow flags client errors, the four hundreds. Red flags server faults, the five hundreds. Purple is throttling. Gray means no recent traffic. Each node is shaded by the ratio of errors and faults to successful calls. Go straight to anything that isn't green, then click it for its response time distribution and error breakdown.

4. Reading a service map

Here, the API Gateway responds in fifty milliseconds. The OrderService fans out three ways: DynamoDB and Inventory are fast, but Payment takes a hundred milliseconds and its external API two hundred, highlighting where to look.

5. Critical path analysis

Finding the critical path is like a road trip hitting roadworks: that one stretch determines when you arrive, however smooth the rest is. Here, the external API is fifty percent of the four hundred milliseconds, so it's your optimization target. Options: add caching, implement a timeout with a fallback, or negotiate better SLAs.

6. Dependency risks

Watch for three dependency risks. Circular dependencies show as bidirectional arrows. Single points of failure appear as one heavily-connected node with no redundancy. External dependencies in yellow or red are outside your control.

7. Understanding trace timelines

When reading a trace, a few points matter. Segments and subsegments sit along a horizontal time axis, showing when each operation started and how long it ran. The vertical axis is the service hierarchy: subsegments indented under the parent that called them. Here the HTTP POST to payment-api is the longest operation, so focus there first.

8. Pattern 1: Sequential bottlenecks

Trace timelines make sequential bottlenecks easy to spot. Instead of fetching one supermarket item per trip, grab them all in one trip, the equivalent of running operations in parallel. Sequential operations stack end-to-end instead of overlapping, so the waste is immediate.

9. Pattern 2: N+1 queries and chatty services

Next is the N+1 pattern. On the left, fetching a user list then running one query per user instead of all at once; fix it by batching the queries or using eager loading. The same waste shows up between services, on the right, as chatty services: fifty small calls where one batched call would do. Each call pays for connection setup, serialization, and a round-trip, fifty times the overhead. The fix is the same: batch the requests, add a message queue, or cache results.

10. Pattern 3: Cold starts and cascading failures

Cold starts show as a large initialization segment, two and a half seconds before the handler runs; warm invocations skip it. Cascading failures are like a power cut in one flat tripping the building's fuses: one service times out, then the one above it, up the chain. Circuit breakers are the fuse-box reset that stops the cascade.

11. Filtering traces with annotations

Annotations let us search traces efficiently. Use a simple filter like user_id equals User-123, or combine criteria with AND and OR. The real power is comparing groups: filter premium against free users and you might find premium features adding latency, two hundred milliseconds versus one hundred and fifty. Filter by region and you might see EU West three times slower than US East, a cross-region data-access problem.

12. Trace analysis workflow

Eight steps guide a trace review. Define a filter to identify the issue, then analyze the data by ordering it. Check annotations and metadata for context. That sets up root cause analysis, before fixing and verifying.

13. Creating alerts from X-Ray data

Once we've identified the failure criteria, we codify it into a CloudWatch alarm that notifies us if it recurs. Here, one alarm watches for average response times over 1000 milliseconds; another for error counts above a set threshold.

14. Video summary

To recap: service maps give an always-current view with color-coded health, and critical path shows where to focus. Trace timelines reveal sequential bottlenecks, N+1 queries, cold starts, and cascading failures. Annotations add business context, and the eight-step workflow takes you from identification to verified resolution. Next, we build unified health dashboards.

15. Let's practice!

Let's take a closer look at analyzing traces and service maps.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.