Skip to main content

3 posts tagged with "cncf"

View All Tags

Cloud-Native AI Model Management and Distribution for Inference Workloads

· 12 min read

Author:

  • Wenbo Qi(Gaius), Dragonfly/ModelPack Maintainer
  • Chenyu Zhang(Chlins), Harbor/ModelPack Maintainer
  • Feynman Zhou, ORAS Maintainer, CNCF Ambassador

Reviewer:

  • Sascha Grunert, CRI-O Maintainer
  • Wei Fu, containerd Maintainer

The weight of AI models: Why infrastructure always arrives slowly

As AI adoption accelerates across industries, organizations face a critical bottleneck that is often overlooked until it becomes a serious obstacle: reliably managing and distributing large model weight files at scale. A model's weights serve as the central artifact that bridges both training and inference pipelines — yet the infrastructure surrounding this artifact is frequently an afterthought.

This article addresses the operational challenges of managing AI model artifacts at enterprise scale, and introduces a cloud-native solution that brings software delivery best practices — versioning, immutability, and GitOps, to the world of large model files.

The gap nobody talks about — until it breaks production

The cloud native gap: Most existing ML model storage approaches were not designed with Kubernetes-native delivery in mind, leaving a critical gap between how software artifacts are managed and how model artifacts are managed.

Today, enterprises operate AI infrastructure on Kubernetes yet their model artifact management lags behind. Software containers are pulled from OCI registries with full versioning, security scanning, and rollback support. Model weights, by contrast, are often downloaded via ad hoc scripts, copied manually between storage buckets, or distributed through unsecured shared filesystems. This gap creates deployment fragility, security risks, and operational overhead at scale.

When your model weighs more than your entire app

Modern foundation models are not small. A single model checkpoint can range from tens of gigabytes to several terabytes. For reference, a quantized LLaMA-3 70B model weighs approximately 140 GB, while frontier multimodal models can easily exceed 1 TB. These are not files you version-control with standard Git — they demand dedicated storage strategies, efficient transfer protocols, and careful access control.

The core challenges are: storage at scale, distribution speed, and reproducibility. Teams need to store multiple model versions, rapidly distribute them to GPU inference nodes across regions, and guarantee that any deployment can be traced back to an exact, immutable artifact.

Three paths forward — and why none of them are enough

Git LFS (Hugging Face Hub)Object Storage (S3, MinIO)Distributed Filesystem (NFS, CephFS)
ProsNative version control (branches, tags, commits, history).Standard offering from cloud providers. Native support in engines like vLLM/SGLang.POSIX compatible. Low integration cost.
ConsPoor protocol adaptation for cloud-native environments. Inherits Git's transport inefficiencies, lacks optimizations for huge file distribution.Lacks structured metadata. Weak version management capabilities.Lacks structured metadata. Weak version management capabilities. High operational complexity for distributed filesystem.

Rethinking the delivery pipeline: Models deserve better than a shell script

The approach described here treats AI model weights as first-class OCI (Open Container Initiative) artifacts, packaging them in the same container registries used for application images. This enables model delivery to leverage the full ecosystem of container tooling: security scanning, signed provenance, GitOps-driven deployment, and Kubernetes-native pulling.

What If we shipped models the same way we ship code?

In the cloud-native era, developers have long established a mature and efficient paradigm for software delivery.

p1

The software delivery:

  1. Develop: Developers commit code to a Git repository, manage code changes through branches, and define versions using tags at key milestones.
  2. Build: CI/CD pipelines compile and test, packaging the output into an immutable Container Image.
  3. Manage and deliver: Images are stored in a Container Registry. Supply chain security (scanning/signing), RBAC, and P2P distribution ensure safe delivery.
  4. Deploy: DevOps engineers use declarative Kubernetes YAML to define the desired state. The Container's lifecycle is managed by Kubernetes.

p2

The cloud native AI model delivery:

  1. Develop: Algorithm engineers push weights and configs to the Hugging Face Hub, treating it as the Git Repository.
  2. Build: CI/CD pipelines package weights, runtime configurations, and metadata into an immutable Model Artifact.
  3. Manage and deliver: The Model Artifact is managed by an Artifact Registry, reusing the existing container infrastructure and toolchain.
  4. Deploy: Engineers use Kubernetes OCI Volumes or a Model CSI Driver. Models are mounted into the inference Container as Volumes via declarative semantics, decoupling the AI model from the inference engine (vLLM, SGLang, etc.).

By applying software delivery paradigms and supply chain thinking to model lifecycle management, we constructed a granular, efficient system that resolves the challenges of managing and distributing AI models in production.

Walking the pipeline: A build story in four steps

Build

modctl is a CLI tool designed to package AI models into OCI artifacts. It standardizes versioning, storage, distribution and deployment, ensuring integration with the cloud-native ecosystem.

p3

Step 1: Auto-generate Modelfile

Run the following in the model directory to generate a definition file.

modctl modelfile generate .
Step 2: Customize Modelfile

You can also customize the content of the Modelfile.

# Model name (string), such as llama3-8b-instruct, gpt2-xl, qwen2-vl-72b-instruct, etc.
NAME qwen2.5-0.5b

# Model architecture (string), such as transformer, cnn, rnn, etc.
ARCH transformer

# Model family (string), such as llama3, gpt2, qwen2, etc.
FAMILY qwen2

# Model format (string), such as onnx, tensorflow, pytorch, etc.
FORMAT safetensors

# Specify model configuration file, support glob path pattern.
CONFIG config.json

# Specify model configuration file, support glob path pattern.
CONFIG generation_config.json

# Model weight, support glob path pattern.
MODEL *.safetensors

# Specify code, support glob path pattern.
CODE *.py
Step 3: Login to Artifact Registry (Harbor)
modctl login -u username -p password harbor.registry.com
Step 4: Build OCI Artifact
modctl build -t harbor.registry.com/models/qwen2.5-0.5b:v1 -f Modelfile .

A Model Manifest is generated after the build. Descriptive information such as ARCH, FAMILY, and FORMAT is stored in a file with the media type application/vnd.cncf.model.config.v1+json.

{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"artifactType": "application/vnd.cncf.model.manifest.v1+json",
"config": {
"mediaType": "application/vnd.cncf.model.config.v1+json",
"digest": "sha256:d5815835051dd97d800a03f641ed8162877920e734d3d705b698912602b8c763",
"size": 301
},
"layers": [
{
"mediaType": "application/vnd.cncf.model.weight.v1.raw",
"digest": "sha256:3f907c1a03bf20f20355fe449e18ff3f9de2e49570ffb536f1a32f20c7179808",
"size": 4294967296
},
{
"mediaType": "application/vnd.cncf.model.weight.v1.raw",
"digest": "sha256:6d923539c5c208de77146335584252c0b1b81e35c122dd696fe6e04ed03d7411",
"size": 5018536960
},
{
"mediaType": "application/vnd.cncf.model.weight.config.v1.raw",
"digest": "sha256:a5378e569c625f7643952fcab30c74f2a84ece52335c292e630f740ac4694146",
"size": 106
},
{
"mediaType": "application/vnd.cncf.model.weight.code.v1.raw",
"digest": "sha256:15da0921e8d8f25871e95b8b1fac958fc9caf453bad6f48c881b3d76785b9f9d",
"size": 394
},
{
"mediaType": "application/vnd.cncf.model.doc.v1.raw",
"digest": "sha256:5e236ec37438b02c01c83d134203a646cb354766ac294e533a308dd8caa3a11e",
"size": 23040
}
]
}
Step 5: Push
modctl push harbor.registry.com/models/qwen2.5-0.5b:v1

Management

Current AI infrastructure workflows focus heavily on model distribution performance, often ignoring model management standards. Manual copying works for experiments, but in large-scale production, lacking unified versioning, metadata specs, and lifecycle management is poor practice. As the standard cloud-native Artifact Registry, Harbor is ideally suited for model storage, treating models as inference artifacts.

Harbor standardizes AI model management through:

  • Versioning: Models are OCI Artifacts with immutable Tags and Sha256 Digests. This guarantees deterministic inference environments. Meanwhile, it visually presents the model's basic attributes, parameter configurations, display information, and the file list, which not only reduce the risks of unknown versions but also achieves full transparency of the model.

p4

  • RBAC: Fine-grained access control. Control who can PUSH (e.g., Algorithm Engineers), who can only PULL (e.g., Inference Services), and who has administrative privileges.

p5

  • Lifecycle management: Tag retention policies automatically purge non-release versions while locking active versions, balancing storage costs with stability.

p6

  • Supply chain security: Integration with Cosign/Notation for signing. Harbor enforces signature verification before distribution, preventing model poisoning attacks.

p7

  • Replication: Automated, incremental synchronization between central and edge registries or active-standby clusters.

p8

  • Audit: Comprehensive logging of all artifact operations (pull/push/delete) for security compliance and traceability.

p9

Delivery

Downloading terabyte-sized model weights directly from the origin introduces bandwidth bottlenecks. We utilize Dragonfly for P2P-based distribution, integrated with Harbor for preheating.

p10

Dragonfly P2P-based distribution

For large-scale distribution scenarios, Dragonfly has been deeply optimized based on P2P technology. Taking the example of 500 nodes downloading a 1TB model, the system distributes the initial download tasks of different layers across nodes to maximize downstream bandwidth utilization and avoid single-point congestion. Combined with a secondary bandwidth-aware scheduling algorithm, it dynamically adjusts download paths to eliminate network hotspots and long-tail latency. For individual model weight, Dragonfly splits individual model weights into pieces and fetches them concurrently from the origin. This enables streaming-based downloading, allowing users to share models without waiting for the complete file. This solution has been proven in high-performance AI clusters, utilizing 70%-80% of each node's bandwidth and improving model deployment efficiency.

p11

Preheating

For latency-sensitive inference services, Harbor triggers Dragonfly to distribute and cache data on target nodes before service scaling. When the instance starts, the model loads from the local disk, achieving zero network latency.

p12

Deployment

Deployment focuses on decoupling the Model (Data) from the Inference Engine (Compute). By leveraging Kubernetes declarative primitives, the Engine runs as a Container, while the Model is mounted as a Volume. This native approach not only enables multiple Pods on the same node to share and reuse the model, saving disk space, but also leverages the preheating and P2P capabilities of Harbor & Dragonfly to eliminate the latency of pulling large model weights, significantly improving startup speed.

p13

OCI Volumes (Kubernetes 1.31+)

Native support for mounting OCI artifacts as volumes via CRI-O/containerd. This feature was introduced as alpha in Kubernetes 1.31 (requires enabling the ImageVolume feature gate) and promoted to beta in Kubernetes 1.33 (enabled by default, no feature gate configuration needed). CRI-O specifically enhances this for LLMs by avoiding decompression overhead at mount time by storing layers uncompressed, resulting in superior performance when mounting large model files.

Step 1: Build YAML
apiVersion: v1
kind: Pod
metadata:
name: vllm-cpu-inference
labels:
app: vllm
spec:
containers:
- name: vllm
image: openeuler/vllm-cpu:latest
command:
- 'python3'
- '-m'
- 'vllm.entrypoints.openai.api_server'
args:
- '--model'
- '/models'
- '--dtype'
- 'float32'
- '--host'
- '0.0.0.0'
- '--port'
- '8000'
- '--max-model-len'
- '1024'
- '--disable-log-requests'
env:
- name: VLLM_CPU_KVCACHE_SPACE
value: '1'
- name: VLLM_WORKER_MULTIPROC_METHOD
value: 'spawn'
resources:
requests:
memory: '2Gi'
cpu: '1'
limits:
memory: '16Gi'
cpu: '8'
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
ports:
- containerPort: 8000
protocol: TCP
name: http
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
- name: model-volume
image:
reference: ghcr.io/chlins/qwen2.5-0.5b:v1
pullPolicy: IfNotPresent
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
type: ClusterIP
Step 2: Deploy inference Workload

p14

Step 3: Call Inference Workload

p15

Model CSI Driver

For compatibility with Kubernetes 1.31 and older, we offer the Model CSI Driver as an interim solution to mount and deploy models as volumes. As OCI Volumes are slated for GA in Kubernetes 1.36, shifting to native OCI Volumes is recommended for the long term.

Step 1: Build YAML
apiVersion: v1
kind: Pod
metadata:
name: vllm-cpu-inference
labels:
app: vllm
spec:
containers:
- name: vllm
image: openeuler/vllm-cpu:latest
command:
- 'python3'
- '-m'
- 'vllm.entrypoints.openai.api_server'
args:
- '--model'
- '/models'
- '--dtype'
- 'float32'
- '--host'
- '0.0.0.0'
- '--port'
- '8000'
- '--max-model-len'
- '1024'
- '--disable-log-requests'
env:
- name: VLLM_CPU_KVCACHE_SPACE
value: '1'
- name: VLLM_WORKER_MULTIPROC_METHOD
value: 'spawn'
resources:
requests:
memory: '2Gi'
cpu: '1'
limits:
memory: '16Gi'
cpu: '8'
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
ports:
- containerPort: 8000
protocol: TCP
name: http
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
- name: model-volume
csi:
driver: model.csi.modelpack.org
volumeAttributes:
model.csi.modelpack.org/reference: ghcr.io/chlins/qwen2.5-0.5b:v1
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
type: ClusterIP
Step 2: Deploy Inference Workload

p14

Step 3: Call Inference Workload

p15

Future

  • Enhanced Preheating: Allow models to be preheated to specified nodes and querying cache distribution across nodes for model-aware pod scheduling.
  • Dragonfly RDMA Acceleration: Enable Dragonfly to utilize InfiniBand or RoCE to improve the speed of distribution.
  • Lazy Loading: Implement on-demand downloading of model weights to reduce startup latency.
  • containerd Optimization: Enhance the OCI Volumes implementation to reduce decompression overhead for large layers.
  • Model Security Scanning: Introduce deep scanning capabilities specifically designed for model weights to detect embedded malicious payloads.

Collaborative Projects

References

Cloud Native Computing Foundation Announces Dragonfly’s Graduation

· 11 min read

Dragonfly graduates after demonstrating production readiness, powering container and AI workloads at scale.

Key Highlights:

  • Dragonfly graduates from CNCF after demonstrating production readiness and widespread adoption across container and AI workloads.
  • Dragonfly is used by major organizations, including Ant Group, Alibaba, Datadog, DiDi, and Kuaishou to power large-scale container and AI model distribution.
  • Since joining the CNCF, Dragonfly is backed by over a 3,000% growth in code contributions and a growing contributor community spanning over 130 companies.

SAN FRANCISCO, Calif. – January 14, 2026 – The Cloud Native Computing Foundation® (CNCF®), which builds sustainable ecosystems for cloud native software, today announced the graduation of Dragonfly, a cloud native open source image and file distribution system designed to solve cloud native image distribution in Kubernetes-centered applications.

“Dragonfly’s graduation reflects the project’s maturity, broad industry adoption and critical role in scaling cloud native infrastructure,” said Chris Aniszczyk, CTO, CNCF. “It’s especially exciting to see the project’s impact in accelerating image distribution and meeting the data demands of AI workloads. We’re proud to support a community that continues to push forward scalable, efficient and open solutions.”

Dragonfly’s Technical Capabilities

Dragonfly delivers efficient, stable, and secure data distribution and acceleration powered by peer-to-peer (P2P) technology. It aims to provide a best‑practice, standards‑based solution for cloud native architectures to improve large‑scale delivery of files, container images, OCI artifacts, AI models, caches, logs, and dependencies.

Dragonfly runs on Kubernetes and is installed via Helm, with its official chart available on Artifact Hub. It also includes tools like Prometheus for tracking performance, OpenTelemetry for collecting and sharing data, and gRPC for rapid communication between parts. Enhancing Harbor capability to distribute images and OCI artifacts through the preheat feature. In the GenAI era, as model serving becomes increasingly important, Dragonfly delivers even more value by distributing AI model artifacts defined by the ModelPack specification.

Dragonfly continues to advance container image distribution, supporting tens of millions of container launches per day in production, saving storage bandwidth by up to 90%, and reducing launch time from minutes to seconds, with large-scale adoption across different cloud native scenarios.

Dragonfly is also driving standards and acceleration solutions for distributing both AI model weights and optimized image layout in AI workloads. The technology reduces data loading for large-scale AI applications and enables the distribution of model weights at a hundred-terabyte scale to hundreds of nodes in minutes. As AI continues to integrate into operations, Dragonfly becomes crucial to powering large-scale AI workloads.

Milestones Driving Graduation

Dragonfly was open-sourced by Alibaba Group in November 2017. It then joined the CNCF as a Sandbox project in October 2018. During this stage, Dragonfly 1.0 became production-ready in November 2019 and the Dragonfly subproject, Nydus, was open-sourced in January 2020. Dragonfly then reached Incubation phase in April 2020, with Dragonfly 2.0 later released in 2021.

Since then, the community has significantly matured and attracted hundreds of contributors from organizations such as Ant Group, Alibaba Cloud, ByteDance, Kuaishou, Intel, Datadog, Zhipu AI, and more, who use Dragonfly to deliver efficient image and AI model distribution.

Since joining CNCF, contributors have increased by 500%, from 45 individuals across 5 companies to 271 individuals across over 130 companies. Commit activity has grown by over 3,000%, from roughly 800 to 26,000 commits, and the number of overall participants has reached 1,890.

What’s Next For Dragonfly

Dragonfly will accelerate AI model weight distribution based on RDMA, improving throughput and reducing end-to-end latency. It will also optimize image layout to reduce data loading time for large-scale AI workloads. A load-aware two-phase scheduling will be introduced, leveraging collaboration between the scheduler and clients to enhance overall distribution efficiency. To provide more stable and reliable services, Dragonfly will support automatic updates and fault recovery, ensuring stable operation of all components during traffic bursts while controlling back-to-source traffic.

Dragonfly’s Graduation Process

To officially graduate from Incubation status, the Dragonfly team enhanced the election policy, clarified the maintainer lifecycle, standardized the contribution process, defined the community ladder, and added community guidelines for subprojects. The graduation process is supported by CNCF’s Technical Oversight Committee (TOC) sponsors for Dragonfly, Karena Angell and Kevin Wang, who conducted a thorough technical due diligence with Dragonfly’s project maintainers.

Additionally, a third-party security audit of Dragonfly was conducted. The Dragonfly team along with the guidance of their TOC sponsors, completed both a self-assessment and a joint assessment with CNCF TAG Security, then collaborated with the Dragonfly security team on a threat model. After this, the team improved the project’s security policy.

Learn more about Dragonfly and join the community: https://d7y.io/

Supporting Quotes

“I am thrilled, as the founder of Dragonfly, to announce its graduation from the CNCF. We are grateful to each and every open source contributor in the community, whose tenacity and commitment have enabled Dragonfly to reach its current state. Dragonfly was created to resolve Alibaba Group’s challenges with ultra-large-scale file distribution and was open-sourced in 2017. Looking back on this journey over the past eight years, every step has embodied the open source spirit and the tireless efforts of the many contributors. This graduation marks a new starting point for Dragonfly. I hope that the project will embark on a new journey, continue to explore more possibilities in the field of data distribution, and provide greater value!”

—Zuozheng Hu, founder of Dragonfly, emeritus maintainer

“I am delighted that Dragonfly is now a CNCF graduated project. This is a significant milestone, reflecting the maturity of the community, the trust of end users, and the reliability of the service. In the future, with the support of CNCF, the Dragonfly team will work together to drive the community’s sustainable growth and attract more contributors. Facing the challenges of large-scale model distribution and data distribution in the GenAI era, our team will continue to explore the future of data distribution within the cloud native ecosystem.”

—Wenbo Qi (Gaius), core Dragonfly maintainer

“Since open-sourcing in 2020, Nydus, alongside Dragonfly, has been validated at production scale. Dragonfly’s graduation is a key milestone for Nydus as a subproject, allowing the project to continue improving the image filesystem’s usability and performance. It will also allow us to further explore ecosystem standardization and AGI use cases that will advance the underlying infrastructure.”

—Song Yan, core Nydus maintainer

“The combination of Dragonfly and Nydus substantially shortens launch times for container images and AI models, enhancing system resilience and efficiency.”

—Jiang Liu, Nydus maintainer

“Thanks to the community’s collective efforts, Dragonfly has evolved from a tool for accelerating container images into a secure and stable distribution system widely adopted by many enterprises. Continuous improvements in usability and stability enable the project to support a variety of scenarios, including CI/CD, edge computing, and AI. New challenges are emerging for the distribution of model weights and data in the age of AI. Dragonfly is becoming a key infrastructure in mitigating these challenges. With the support of the CNCF, Dragonfly will continue to drive the future evolution of cloud native distribution technologies.”

—Yuan Yang, Dragonfly maintainer

TOC Sponsors

“We’re grateful to the TOC members who dedicated significant time to the technical due diligence required for Dragonfly’s advancement, as well as the technical leads and community members who supported governance and technical reviews. We also thank the project maintainers for their openness and responsiveness throughout this process, and the end users who met with TOC and TAB members to share their experiences with Dragonfly. This level of collaboration is what helps ensure the strength and credibility of the CNCF ecosystem.”

—Karena Angell, chair, TOC, CNCF

“Dragonfly’s graduation is a testament to the project’s technical maturity and the community’s consistent focus on performance, reliability, and tangible impact. It’s been impressive to see Dragonfly evolve to meet the needs of large-scale production environments and AI workloads. Congratulations to the maintainers and contributors who’ve worked hard to reach this milestone.”

—Kevin Wang, vice chair, TOC, CNCF

Project End Users

“Over the past few years, as part of Ant Group’s container infrastructure, the Dragonfly project has accelerated container image and code package delivery across several 10K-node Kubernetes clusters. It has significantly saved image transmission bandwidth, and the Nydus subproject has additionally helped us to reduce image pull time to near zero. The project also supports the delivery of large language models within our AI infrastructure. It is a great honor to have contributed to Dragonfly and to have shared our practices with the community.”

—Xu Wang, head of the container infra team, Ant Group, and co-launcher of Kata Containers Project

“Dragonfly has become a key infrastructure component of the container image and data distribution system for Alibaba’s large-scale distributed systems. In ultra-large-scale scenarios such as the Double 11 (Singles’ Day) shopping festival, Dragonfly has provided stable and efficient distribution capabilities and has improved the efficiency of system scaling and delivery. Facing the new technological challenges of the AI era, Dragonfly has played an important role in model data distribution and cache acceleration, helping us to build a more efficient, intelligent computing platform. We are happy to see Dragonfly graduate, which represents an enhancement in community maturity and validates its reliability in large-scale production environments.”

—Li Yi, director of engineering for container service, Alibaba Cloud & Tao Huang, director of engineering for cloud native transformation project, Alibaba Group

“Datadog recently adopted the Dragonfly subproject, Nydus, and it has helped significantly reduce time spent pulling images. This includes AI workloads, where image pulls previously took 5 minutes, and node daemonsets, which have startup speeds directly related to how quickly applications can be scheduled on nodes. We have seen significant improvements using Nydus, now everything starts in a matter of seconds. We are thrilled to see Dragonfly graduate and hope to continue to contribute to this impressive ecosystem!”

—Baptiste Girard-Carrabin, Datadog

“DiDi uses a distributed cloud platform to handle a large number of user requests and quickly adjust resources, which requires very efficient and stable management of resource distribution and image synchronization. Dragonfly is a core component of our technical architecture due to its strong cloud native adaptability, excellent P2P acceleration capabilities, and proven stability in large-scale scenarios. We believe that Dragonfly’s graduation is a strong testament to its technical maturity and industry value. We also look forward to its continued advancement in the field of cloud native distribution, providing more efficient solutions for large-scale file synchronization, image distribution, and other enterprise scenarios.”

—Feng Wu, head of the Elastic Cloud, DiDi & Rapier Yang, head of the Elastic Cloud, DiDi

“At Kuaishou, Dragonfly is considered the cornerstone of our container infrastructure, and it will soon provide stable and reliable image distribution capabilities for tens of thousands of services and hundreds of thousands of servers. Integrated with its subproject Nydus, Dragonfly dramatically enhances application startup efficiency while significantly alleviating disk I/O pressure—ensuring stability for services. In the era of AI large models, Dragonfly also functions as a critical component of our AI infrastructure, providing exceptional acceleration capabilities for large model distribution. We are deeply honored to partner with the vibrant Dragonfly community in collectively exploring future innovations for cloud native distribution technologies.”

—Wang Yao, head of container registry service of Kuaishou

About Cloud Native Computing Foundation

Cloud native computing empowers organizations to build and run scalable applications with an open source software stack in public, private, and hybrid clouds. The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure, including Kubernetes, Prometheus, and Envoy. CNCF brings together the industry’s top developers, end users, and vendors and runs the largest open source developer conferences in the world. Supported by nearly 800 members, including the world’s largest cloud computing and software companies, as well as over 200 innovative startups, CNCF is part of the nonprofit Linux Foundation. For more information, please visit www.cncf.io.

Dragonfly integrates nydus for image acceleration practive

· 11 min read
Gaius
Dragonfly Maintainer

Introduce definition

Dragonfly has been selected and put into production use by many Internet companies since its open source in 2017, and entered CNCF in October 2018, becoming the third project in China to enter the CNCF Sandbox. In April 2020, CNCF TOC voted to accept Dragonfly as an CNCF Incubating project. Dragonfly has developed the next version through production practice, which has absorbed the advantages of Dragonfly1.x and made a lot of optimizations for known problems.

Nydus optimized the OCIv1 image format, and designed a brand new image-based filesystem, so that the container can download the image on demand, and the container no longer needs to download the complete image to start the container. In the latest version, dragonfly has completed the integration with the nydus, allowing the container to start downloading images on demand, reducing the amount of downloads. The dragonfly P2P transmission method can also be used during the transmission process to reduce the back-to-source traffic and increase the speed.

Quick start

Prerequisites

NameVersionDocument
Kubernetes cluster1.20+kubernetes.io
Helm3.8.0+helm.sh
Containerdv1.4.3+containerd.io
Nerdctl0.22+containerd/nerdctl

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
extraPortMappings:
- containerPort: 30950
hostPort: 65001
- containerPort: 30951
hostPort: 40901
- role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latest
docker pull dragonflyoss/manager:latest
docker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latest
kind load docker-image dragonflyoss/manager:latest
kind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and enable prefetching, configuration content is as follows:

scheduler:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

seedPeer:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
download:
prefetch: true

dfdaemon:
hostNetwork: true
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
download:
prefetch: true
proxy:
defaultFilter: 'Expires&Signature&ns'
security:
insecure: true
tcpListen:
listen: 0.0.0.0
port: 65001
registryMirror:
dynamic: true
url: https://index.docker.io
proxies:
- regx: blobs/sha256.*

manager:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
NAME: dragonfly
LAST DEPLOYED: Wed Oct 19 04:23:22 2022
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME READY STATUS RESTARTS AGE
dragonfly-dfdaemon-rhnr6 1/1 Running 4 (101s ago) 3m27s
dragonfly-dfdaemon-s6sv5 1/1 Running 5 (111s ago) 3m27s
dragonfly-manager-67f97d7986-8dgn8 1/1 Running 0 3m27s
dragonfly-mysql-0 1/1 Running 0 3m27s
dragonfly-redis-master-0 1/1 Running 0 3m27s
dragonfly-redis-replicas-0 1/1 Running 1 (115s ago) 3m27s
dragonfly-redis-replicas-1 1/1 Running 0 95s
dragonfly-redis-replicas-2 1/1 Running 0 70s
dragonfly-scheduler-0 1/1 Running 0 3m27s
dragonfly-seed-peer-0 1/1 Running 2 (95s ago) 3m27s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
name: peer
namespace: dragonfly-system
spec:
type: NodePort
ports:
- name: http-65001
nodePort: 30950
port: 65001
- name: http-40901
nodePort: 30951
port: 40901
selector:
app: dragonfly
component: dfdaemon
release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Install nydus for containerd

For detailed nydus installation documentation based on containerd environment, please refer to nydus-setup-for-containerd-environment. The example uses Systemd to manage the nydus-snapshotter service.

Install nydus tools

Download containerd-nydus-grpc binary, please refer to nydus-snapshotter/releases:

NYDUS_SNAPSHOTTER_VERSION=0.3.3
wget https://github.com/containerd/nydus-snapshotter/releases/download/v$NYDUS_SNAPSHOTTER_VERSION/nydus-snapshotter-v$NYDUS_SNAPSHOTTER_VERSION-x86_64.tgz
tar zxvf nydus-snapshotter-v$NYDUS_SNAPSHOTTER_VERSION-x86_64.tgz

Install containerd-nydus-grpc tool:

sudo cp nydus-snapshotter/containerd-nydus-grpc /usr/local/bin/

Download nydus-image, nydusd and nydusify binaries, please refer to dragonflyoss/image-service:

NYDUS_VERSION=2.1.1
wget https://github.com/dragonflyoss/image-service/releases/download/v$NYDUS_VERSION/nydus-static-v$NYDUS_VERSION-linux-amd64.tgz
tar zxvf nydus-static-v$NYDUS_VERSION-linux-amd64.tgz

Install nydus-image, nydusd and nydusify tools:

sudo cp nydus-static/nydus-image nydus-static/nydusd nydus-static/nydusify /usr/local/bin/

Install nydus snapshotter plugin for containerd

Configure containerd to use the nydus-snapshotter plugin, please refer to configure-and-start-containerd.

127.0.0.1:65001 is the proxy address of dragonfly peer, and the X-Dragonfly-Registry header is the address of origin registry, which is provided for dragonfly to download the images.

Change configuration of containerd in /etc/containerd/config.toml:

[proxy_plugins]
[proxy_plugins.nydus]
type = "snapshot"
address = "/run/containerd-nydus/containerd-nydus-grpc.sock"

[plugins.cri]
[plugins.cri.containerd]
snapshotter = "nydus"
disable_snapshot_annotations = false

Restart containerd service:

sudo systemctl restart containerd

Check that containerd uses the nydus-snapshotter plugin:

$ ctr -a /run/containerd/containerd.sock plugin ls | grep nydus
io.containerd.snapshotter.v1 nydus - ok

Systemd starts nydus snapshotter service

For detailed configuration documentation based on nydus mirror mode, please refer to enable-mirrors-for-storage-backend.

Create nydusd configuration file nydusd-config.json, configuration content is as follows:

{
"device": {
"backend": {
"type": "registry",
"config": {
"mirrors": [
{
"host": "http://127.0.0.1:65001",
"auth_through": false,
"headers": {
"X-Dragonfly-Registry": "https://index.docker.io"
},
"ping_url": "http://127.0.0.1:40901/server/ping"
}
],
"scheme": "https",
"skip_verify": false,
"timeout": 10,
"connect_timeout": 10,
"retry_limit": 2
}
},
"cache": {
"type": "blobcache",
"config": {
"work_dir": "/var/lib/nydus/cache/"
}
}
},
"mode": "direct",
"digest_validate": false,
"iostats_files": false,
"enable_xattr": true,
"fs_prefetch": {
"enable": true,
"threads_count": 10,
"merging_size": 131072,
"bandwidth_rate": 1048576
}
}

Copy configuration file to /etc/nydus/config.json:

sudo mkdir /etc/nydus && cp nydusd-config.json /etc/nydus/config.json

Create systemd configuration file nydus-snapshotter.service of nydus snapshotter, configuration content is as follows:

[Unit]
Description=nydus snapshotter
After=network.target
Before=containerd.service

[Service]
Type=simple
Environment=HOME=/root
ExecStart=/usr/local/bin/containerd-nydus-grpc --config-path /etc/nydus/config.json
Restart=always
RestartSec=1
KillMode=process
OOMScoreAdjust=-999
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Copy configuration file to /etc/systemd/system/:

sudo cp nydus-snapshotter.service /etc/systemd/system/

Systemd starts nydus snapshotter service:

$ sudo systemctl enable nydus-snapshotter
$ sudo systemctl start nydus-snapshotter
$ sudo systemctl status nydus-snapshotter
● nydus-snapshotter.service - nydus snapshotter
Loaded: loaded (/etc/systemd/system/nydus-snapshotter.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-10-19 08:01:00 UTC; 2s ago
Main PID: 2853636 (containerd-nydu)
Tasks: 9 (limit: 37574)
Memory: 4.6M
CPU: 20ms
CGroup: /system.slice/nydus-snapshotter.service
└─2853636 /usr/local/bin/containerd-nydus-grpc --config-path /etc/nydus/config.json

Oct 19 08:01:00 kvm-gaius-0 systemd[1]: Started nydus snapshotter.
Oct 19 08:01:00 kvm-gaius-0 containerd-nydus-grpc[2853636]: time="2022-10-19T08:01:00.493700269Z" level=info msg="gc goroutine start..."
Oct 19 08:01:00 kvm-gaius-0 containerd-nydus-grpc[2853636]: time="2022-10-19T08:01:00.493947264Z" level=info msg="found 0 daemons running"

Convert an image to nydus format

Convert python:3.9.15 image to nydus format, you can use the converted dragonflyoss/python:3.9.15-nydus image and skip this step. Conversion tool can use nydusify and acceld.

Login to Dockerhub:

docker login

Convert python:3.9.15 image to nydus format, and DOCKERHUB_REPO_NAME environment variable needs to be set to the user's image repository:

DOCKERHUB_REPO_NAME=dragonflyoss
sudo nydusify convert --nydus-image /usr/local/bin/nydus-image --source python:3.9.15 --target $DOCKERHUB_REPO_NAME/python:3.9.15-nydus

Try nydus with nerdctl

Running python:3.9.15-nydus with nerdctl:

sudo nerdctl --snapshotter nydus run --rm -it $DOCKERHUB_REPO_NAME/python:3.9.15-nydus

Check that nydus is downloaded via dragonfly based on mirror mode:

$ grep mirrors /var/lib/containerd-nydus/logs/**/*log
[2022-10-19 10:16:13.276548 +00:00] INFO [storage/src/backend/connection.rs:271] backend config: ConnectionConfig { proxy: ProxyConfig { url: "", ping_url: "", fallback: false, check_interval: 5, use_http: false }, mirrors: [MirrorConfig { host: "http://127.0.0.1:65001", headers: {"X-Dragonfly-Registry": "https://index.docker.io"}, auth_through: false }], skip_verify: false, timeout: 10, connect_timeout: 10, retry_limit: 2 }

Performance testing

Test the performance of single-machine image download after the integration of nydus mirror mode and dragonfly P2P. Test running version commands using images in different languages. For example, the startup command used to run a python image is python -V. The tests were performed on the same machine. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

nydus-mirror-dragonfly

  • OCIv1: Use containerd to pull image directly.
  • Nydus Cold Boot: Use containerd to pull image via nydus-snapshotter and doesn't hit any cache.
  • Nydus & Dragonfly Cold Boot: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and no cache hits.
  • Hit Dragonfly Remote Peer Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the remote peer cache.
  • Hit Dragonfly Local Peer Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the local peer cache.
  • Hit Nydus Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the nydus local cache.

Test results show nydus mirror mode and dragonfly P2P integration. Use the nydus download image to compare the OCIv1 mode, It can effectively reduce the image download time. The cold boot of nydus and nydus & dragonfly are basically close. All hits to dragonfly cache are better than nydus only. The most important thing is that if a very large kubernetes cluster uses nydus to pull images. The download of each image layer will be generate as many range requests as needed. The QPS of the source of the registry is too high. Causes the QPS of the registry to be relatively high. Dragonfly can effectively reduce the number of requests and download traffic for back-to-source registry. In the best case, dragonfly can make the same task back-to-source download only once.

Dragonfly community

Nydus community