Monitoring

1. Monitoring

Now that you understand Google Cloud Observability from a high-level perspective, let's look at Cloud Monitoring. Monitoring is important to Google because it is at the base of site reliability engineering, a discipline that applies aspects of software engineering to operations whose goals are to create ultra-scalable and highly reliable software systems. This discipline has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world. Cloud Monitoring dynamically configures monitoring after resources are deployed and has intelligent defaults that allow you to easily create charts for basic monitoring activities. This allows you to monitor your platform, system, and application metrics by ingesting data, such as metrics, events, and metadata. You can then generate insights from this data through dashboards, charts, and alerts. For example, you can configure and measure uptime and health checks that send alerts via email. A metrics scope is the root entity that holds monitoring and configuration information in Cloud Monitoring. Each metrics scope can have between 1 and 375 monitored projects. Monitoring data for all projects in that scope will be visible. A metrics scope contains the custom dashboards, alerting policies, uptime checks, notification channels, and group definitions that you use with your monitored projects. A metrics scope can access metric data from its monitored projects, but the metrics data and log entries remain in the individual projects. The first monitored Google Cloud project in a metrics scope is called the scoping project, and it must be specified when you create the metrics scope. The name of that project becomes the name of your metrics scope. To access an AWS account, you must configure a project in Google Cloud to hold the AWS Connector. Because metrics scopes can monitor all your Google Cloud projects in a single place, a metrics scope is a "single pane of glass" through which you can view resources from multiple Google Cloud projects and AWS accounts. All users of Google Cloud Observability with access to that metrics scope have access to all data by default. This means that a role assigned to one person on one project applies equally to all projects monitored by that metrics scope. In order to give people different roles per project and to control visibility to data, consider placing the monitoring of those projects in separate metrics scopes. Cloud Monitoring allows you to create custom dashboards that contain charts of the metrics that you want to monitor. For example, you can create charts that display your instances' CPU utilization, the packets or bytes sent and received by those instances, and the packets or bytes dropped by the firewall of those instances. In other words, charts provide visibility into the utilization and network traffic of your VM instances, as shown on this slide. These charts can be customized with filters to remove noise, groups to reduce the number of time series, and aggregates to group multiple time series together. For a full list of supported metrics, please refer to the documentation. Although charts are extremely useful, they can only provide insight while someone is looking at them. But what if your server goes down in the middle of the night or over the weekend? Do you expect someone to always look at dashboards to determine whether your servers are available or have enough capacity or bandwidth? If not, you want to create alerting policies that notify you when specific conditions are met. For example, as shown on this slide, you can create an alerting policy when the network egress of your VM instance goes above a certain threshold for a specific timeframe. When this condition is met, you or someone else can be automatically notified through email, SMS, or other channels in order to troubleshoot this issue. You can also create an alerting policy that monitors your usage of Google Cloud Observability and alerts you when you approach the threshold for billing. For more information about this, please refer to the documentation. Here is an example of what creating an alerting policy looks like. On the left, you can see an HTTP check condition on the summer01 instance. This will send an email that is customized with the content of the documentation section on the right. Let's discuss some best practices when creating alerts: We recommend alerting on symptoms, and not necessarily causes. For example, you want to monitor failing queries of a database and then identify whether the database is down. Next, make sure that you are using multiple notification channels, like email and SMS. This helps avoid a single point of failure in your alerting strategy. We also recommend customizing your alerts to the audience's needs by describing what actions need to be taken or what resources need to be examined. And finally, avoid noise, because this will cause alerts to be dismissed over time. Specifically, adjust monitoring alerts so that they are actionable and don't just set up alerts on everything possible. Uptime checks can be configured to test the availability of your public services from locations around the world, as you can see on this slide. The type of uptime check can be set to HTTP, HTTPS, or TCP. The resource to be checked can be an App Engine application, a Compute Engine instance, a URL of a host, or an AWS instance or load balancer. For each uptime check, you can create an alerting policy and view the latency of each global location. Here is an example of an HTTP uptime check. The resource is checked every minute with a 10-second timeout. Uptime checks that do not get a response within this timeout period are considered failures. So far there is a 100% uptime with no outages. Monitoring data can originate at a number of different sources. With Google Compute Engine instances, because the VMs are running on Google hardware, the hypervisor cannot access some of the internal metrics inside a VM, for example, memory usage. The Ops Agent collects metrics inside the VM, not at the hypervisor level. The Ops Agent is the primary agent for collecting telemetry data from your Compute Engine instances. This diagram shows how data is collected to monitor workloads running on a Compute Engine instance. Ops Agent installed on Compute Engine collects data beyond the system metrics. The collected metric is then used by Cloud Monitoring to create dashboards, alerts, uptime checks, and notifications to drive observability for workloads running in your application. You can configure the Ops Agent to monitor many third-party applications. For a detailed list, refer to the documentation. The Ops Agent supports most major operating systems, such as CentOS, Ubuntu, and Windows. If the standard metrics provided by Cloud Monitoring do not fit your needs, you can create custom metrics. For example, imagine a game server that has a capacity of 50 users. What metric indicator might you use to trigger scaling events? From an infrastructure perspective, you might consider using CPU load or perhaps network traffic load as values that are somewhat correlated with the number of users. But with a custom metric, you could actually pass the current number of users directly from your application into Cloud Monitoring. To get started with creating custom metrics, please refer to the documentation. When you want to maintain a metric at a target value, specify a utilization target. The autoscaler creates VMs when the metric value is above the target and deletes VMs when the metric value is below the target. If the metric comes from each VM in your managed instance group, then the autoscaler takes the average metric value across all VMs in the managed instance group and compares it with the utilization target. If the metric applies to the whole managed instance group and does not come from the VMs in your managed instance group, then the autoscaler compares the metric value with the utilization target. When your metric has multiple values, apply a filter to autoscale using an individual value from the metric.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.