AI cost management on GKE

1. AI cost management on GKE

AI and ML workloads have unique infrastructure demands that can be met by running containerized workloads on GKE. Enterprises aim to build AI ML platforms that are not only effective but also costefficient. GKE is a flexible platform offering solutions for both managed and unmanaged services. Effective scaling is essential for optimizing cost. ensuring that you only pay for the infrastructure you need. You can choose from two different modes of operation in GKE. GKE standard and GKE autopilot. GKE standard provides advanced configuration options and flexibility for hands-on control. Alternatively, GKE Autopilot offers a hands-free experience where Google manages cluster provisioning, management, and workload deployment, significantly reducing the operational burden and lowering the barrier to entry. GKE autopilot costs are determined by the workload requirements for memory, CPU, and storage. You don't have to pay for system pods, operating system overhead, unallocated space, and unscheduled pods. This ensures that costs are kept to a minimum when no workloads are running. AI and ML workloads often require hardware accelerators, which can be a significant expense. There are several GKE innovations available to optimize accelerator usage and reduce operational costs. Time sharing allows multiple containers to share a single physical GPU. This can be perfect for workloads like notebooks or low volume inference. Multi-instance allows larger GPUs to be partitioned into several smaller ones to better fit the needs of your model. And multi-process service is an NVIDIA technology that provides an additional way for applications to share a GPU concurrently. Tensor processing units or TPUs are a costefficient scalable option for a wide range of AI workloads including training, fine-tuning, and inference. GKE is the first managed Kubernetes offering to support TPUs, providing scalable computing resources. Running TPUs on GKE offers excellent price performance combined with industry-leading scalability. GKE supports technologies that optimize the use of underlying infrastructure. Strategies include reducing container image size using NVIDIA Triton and faster transformer, re-evaluating accelerator sizes to use smaller GPUs for less latency sensitive workloads, using spot instances for up to 90% cost savings, and leveraging cloud storage fuse to expose buckets as locally mounted folders with a file cache. These approaches help minimize infrastructure costs effectively. Before wrapping up, let's explore cloud storage views further. Cloud storage views offers a direct link to cloud storage for model weights stored in object storage buckets. It includes a caching mechanism for frequently read files which prevents additional downloads from the source bucket and reduces latency. An advantage of cloud storage fuse is that it doesn't require prehydration operational activities for a pod to download new files in a bucket. However, if you switch buckets, you'll need to restart the pod with an updated cloud storage fuse configuration. To enhance performance, you can enable parallel downloads, which uses multiple workers to download a model, significantly improving model pull performance. Cloud Storage Fuse increases speed and lowers the cost of AI and ML training by reducing time spent waiting for data. It improves training time up to 2.3x and increases throughput 3.4x. This provides a performance boost for multi-epoch training workflows and it dramatically improves the speed of small randomly distributed input output operations.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Manage Scalable Workloads in GKE

AdvancedSkill Level

4.5+

4 reviews

In this introduction, you'll explore the course goals and preview each section.

Exercise 1: Course introduction