Kubernetes Image Pull Optimisation [ Part(I) — Exploring options]

NIRAV SHAH
9 min readOct 24, 2023

Introduction

In the ever-evolving world of container orchestration and Kubernetes, efficient image management is crucial for smooth deployments. Kubernetes applications rely heavily on container images, and as these applications grow, so does the need for effective image caching solutions. In this article, we’ll delve into Kubernetes image cache options, with a focus on kube-fledged, Kraken, and Dragonfly, and explore how they can optimize your Kubernetes deployments.

Understand the Problem of Image pull

By default, the kubelet tries to pull each image from the specified registry. However, if the imagePullPolicy property of the container is set to IfNotPresent or Never, then a local image is used (preferentially or exclusively, respectively).

Node groups Pull Images when requested from Public / Private repository

Pulling image is one of the time-consuming steps in the container lifecycle. Research shows that time to take for pull operation accounts for 76% of container startup time[FAST ‘16].

In our case, we found below the startup time for containers. The time shown are average of 20 different container images.

Image Pull consumes major bootup time

Above we can observe that the majority of time consumed to start the container is Pod Image Pull. The best case scenario here generates when a pod image is available locally on the node available. However, worst case scenario depends on the image size and network bandwidth available on the Node. Although we mentioned 120 seconds as the worst time, it can be larger for some 5–6GB image sizes.

Let’s try to optimise the overall bootup time of the pod. Node provisioning and Image pull are 2 components where we can optimise performance. Rest all events are based on application code & we have to provide time for the respective code.

Optimise Node Provisioning

Using Karpenter node provisioning over traditional ASG [ Autoscaling Group ] for EKS improves performance. ASG takes roughly 2 minutes to provision a node as Karpenter nodes are available in 30 seconds.

Well, can we reduce the Node provisioning time to 0? Yes, we can. We can use over-provisioned Nodes with the use of low-priority pods. This I have detailed in my blog “Enhancing Kubernetes Scalability and Responsiveness with Pod Priority and Over-Provisioning — Terraform”

Summary

As detailed. It is important for us to optimise Image Pull. The main concept is the use of an Image caching mechanism.

The Importance of Image Caching

Before we dive into specific solutions, let’s understand why image caching is a critical component in Kubernetes:

  • Faster Deployments: Caching images reduces the time required for pods to start, resulting in quicker application deployments.
  • Network and Resource Efficiency: By storing images locally, you minimize the need to fetch them from remote registries, saving bandwidth and resources.
  • Enhanced Availability: Caching improves application reliability since you can rely on locally available images, even if the remote registry experiences downtime.
  • Dockerhub api limit: For public repositories like dockerhub, we have well defined api limit, if a cluster creates too many pods it can block the further request with api limit & eventually face downtime.

Options for optimising Image Pull

There are a few options for optimising Image Pull. We would go through least benefit to major benefit sequences.

Option 1: Pre-pulled Images

Setting imagePullPolicy to IfNotPresent would allow the use of Node’s cached image. So if a new pod with the same image is scheduled on the same node, it would take nearly 0 seconds to pull the image. It is equivalent to a docker pull on your local machine, once done all subsequent docker runs are instantaneous.

Pod1 pulls the image from the registry & stores them locally on the node, and Pod2 uses the Image stored locally.

Cons: However, on a busy Kubernetes cluster, provisioning the same pod on the same nodes is rare.

Option 2: Parallel Image pulls

When serializeImagePulls is set to false, the kubelet defaults to no limit on the maximum number of images being pulled at the same time. If you would like to limit the number of parallel image pulls, you can set the field maxParallelImagePulls in kubelet configuration. With maxParallelImagePulls set to n, only n images can be pulled at the same time, and any image pull beyond n will have to wait until at least one ongoing image pull is complete.

Multiple images pulled in parallel

Cons: This option is available from Kubernetes v1.27 [alpha]. For a pod where multiple containers are used, each image would be pulled sequentially even if the above parameter is configured.

Option 3: Lazy-loading container images / Image streaming

Lazy loading is an approach where data is downloaded from the registry in parallel with the application startup.

arter et al FAST ’16 found that image download accounts for 76% of container startup time, but on average only 6.4% of the fetched data is actually needed for the container to start doing useful work.

Usual startup of container images without Lazy load enable
Startup of Container Images with Lazy load enabled

Software provides Lazy loading:

Seekable OCI (SOCI) is a technology open-sourced by AWS that enables containers to launch faster by lazily loading the container image.

The “stargz-snapshotter” is a tool designed to optimize container image management in Kubernetes by using a more efficient snapshot format (stargz) and integrating with Kubernetes as a CSI driver. It aims to reduce storage space usage and improve container startup times, making it a valuable addition to Kubernetes environments where these optimizations are critical.

GKE (Google Kubernetes Engine) image streaming is a feature that enables you to quickly provision and deploy container images from a private container registry directly to your GKE cluster’s nodes. This improves the speed and efficiency of deploying container images by streaming them directly to the nodes instead of pulling the entire image from the registry before running the containers.

These options require a good understanding, of how image streaming / Lazy loading is implemented, I have dedicated a medium article for the same. Stay tuned. The link will be available soon.

cons: The Lazy loading does not work with the public registry. Although ECR & Google Artifectory supports Lazy loading, it requires specific build steps/tool and configuration. Hence it requires rebuilding images or a one-time migration effort along with this configuration.

Option 4: Prefetch Images

Prefetching images, in the context of containerization and Kubernetes, refers to the practice of proactively pulling container images from a container registry or repository before they are needed to run containers. This is done in advance, typically during cluster initialization or setup, so that when a pod is scheduled to run, the required image is already available locally.

Image pushed to New and existing Node whenever image change occurs

Software provides Prefetch Images:

Custom implementation: AWS Event bridge-based configuration with the use of a system manager can be used for pushing images from ECR to Node groups.

Kube-fledged is a Kubernetes operator that creates and manages a cache of container images on the worker nodes of a Kubernetes cluster. It allows users to define a list of images and which worker nodes those images should be cached on.

These options require a good understanding, of how image prefetch is implemented, I have dedicated a medium article for the same too. Stay tuned. The link will be available soon.

cons: Avoid using too many images, this may occupy Node storage & cause cluster stability issues.

This solution also assumes that the image pull occurs before a Pod gets scheduled on the node. If the Pod is scheduled before the image is cached, then the node must pull the image from the container registry, thus rendering this technique ineffective.

Option 5: pull through cache

Container images are stored in registries and pulled into environments where they run. There are many different types of registries from private, self-run registries to public, unauthenticated registries. The registry you use is a direct dependency that can have an impact on how fast you can scale, the security of the software you run, and the availability of your systems. For production environments, it’s recommended that customers limit external dependencies that impact these areas and host container images in a private registry.

AWS ECR supports image cache for quay & registry.k8s.io. It is recommended that we change image registery path for each deployment by manually updating image tag or using https://kyverno.io/ policy manager.

AWS ECR supports Quay & registry.k8s.io as pull through cache

cons: This improves performance by caching images in a private repository preferably located in the same region as an external registry. It does not improve the performance of large images hosted in a private registry.

Option 6: Kubernetes Native container registry

Today, most container registries run independently of the clusters that run the containers built from their images. However running a registry within a cluster itself could offer many advantages, including faster boot times, better auditing, and more control over the namespace.

Trow, The primary goal for Trow is to create a registry that runs within Kubernetes and provides a secure and fast way to get containers running on the cluster

Harbor is an open-source container image registry and management platform for securing and managing container images in Kubernetes and Docker environments. Harbor was initially developed by VMware and is now part of the Cloud Native Computing Foundation (CNCF) landscape. Harbor replication would ensure an image caching effect.

cons: Duplicate image registry needs to be maintained. We may have to change the CI of images to support the use of images. The pod image source needs to be modified.

Option 7: P2P based Image distribution

P2P-based image distribution refers to the distribution or sharing of images using a Peer-to-Peer (P2P) network architecture. In a P2P network, participants, known as peers, act as both suppliers and consumers of resources. Unlike traditional ​client-server architectures, P2P networks do not rely on a ​central server or authority to control the distribution process. Instead, each peer has equal privileges and can directly communicate and interact with other peers in the network.

Kraken is a more robust image caching solution for Kubernetes, known for its scalability and extensive feature set. It goes beyond basic caching, providing advanced options for image distribution across clusters.

Dragonfly is a unique Kubernetes image cache option that leverages P2P technology for image distribution. It can cache and distribute images across nodes using a peer-to-peer network.

These options require a good understanding, of how P2P docker image works, I have dedicated a medium article for the same too. Stay tuned. The link will be available soon.

Cons: System design & debugging issues become complex.

Conclusion

Efficient image caching is essential for optimizing Kubernetes deployments, and the right choice of image cache options can make a significant difference. The choice between options depends on your specific use case, expertise, and infrastructure scale. By implementing an image caching solution that fits your requirements, you can ensure faster, more reliable, and resource-efficient Kubernetes deployments.

Choose your Kubernetes image cache option wisely, and watch your deployments become smoother and more responsive than ever before. We can choose multiple options together and gain more benefits too.

Next

Now we have explored various options for image caching. Before going into details of each software, we should explore the tool to measure image startup performance. Follow my next blog. Please provide comments If I miss any Image Pull optimisation options.

Reference:

--

--