Skip to main content

16 posts tagged with "nydus-snapshotter"

View All Tags

Volcano Engine, distributed image acceleration practice based on Dragonfly

· 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Terms and definitions

TermDefinition
OCIThe Open Container Initiative is a Linux Foundation project launched by Docker in June 2015 to design open standards for operating system-level virtualization (and most importantly Linux containers).
OCI ArtifactProducts that follow the OCI image spec.
ImageThe image in this article refers to OCI Artifact
Image DistributionA product distribution implemented according to the OCI distribution spec.
ECSIt is a collection of resources composed of CPU, memory, and Cloud Drive, each of which logically corresponds to the computing hardware entity of the Data center infrastructure.
CRVolcano Engine image distribution service.
VKEVolcano Engine deeply integrates the new generation of Cloud Native technology to provide high-performance Kubernetes container cluster management services with containers as the core, helping users to quickly build containerized applications.
VCIVolcano is a serverless and containerized computing service. The current VCI seamlessly integrates with the Container Service VKE to provide Kubernetes orchestration capabilities. With VCI, you can focus on building the app itself, without having to buy and manage infrastructure such as the underlying Cloud as a Service, and pay only for the resources that the container actually consumes to run. VCI also supports second startup, high concurrent creation, sandbox Container Security isolation, and more.
TOSVolcano Engine provides massive, secure, low-cost, easy-to-use, highly reliable and highly available distributed cloud storage services.
Private ZonePrivate DNS service based on a proprietary network VPC (Virtual Private Cloud) environment. This service allows private domain names to be mapped to IP addresses in one or more custom VPCs.
P2PPeer-to-peer technology, when a peer in a P2P network downloads data from the server, it can also be used as a server level for other peers to download after downloading the data. When a large number of nodes download at the same time, it can ensure that the subsequent downloaded data does not need to be downloaded from the server side. Thereby reducing the pressure on the server side.
DragonflyDragonfly is a file distribution and image acceleration system based on P2P technology, and is the standard solution and best practice in the field of image acceleration in Cloud Native architecture. Now hosted as an incubation project by the Cloud Native Computing Foundation (CNCF).
NydusNydus Acceleration Framework implements a content-addressable filesystem that can accelerate container image startup by lazy loading. It has supported the creation of millions of accelerated image containers daily, and deeply integrated with the linux kernel's erofs and fscache, enabling in-kernel support for image acceleration.

Background

Volcano Engine image repository CR uses TOS to store container images. Currently, it can meet the demand of large-scale concurrent image pulling to a certain extent. However, the final concurrency of pulling is limited by the bandwidth and QPS of TOS.

Here is a brief introduction of the two scenarios that are currently encountered for large-scale image pulling:

  1. The number of clients is increasing, and the images are getting larger. The bandwidth of TOS will eventually be insufficient.
  2. If the client uses Nydus to convert the image format, the request volume to TOS will increase by an order of magnitude. The QPS limit of TOS API makes it unable to meet the demand.

Whether it is the image repository service itself or the underlying storage, there will be bandwidth and QPS limitations in the end. If you rely solely on the bandwidth and QPS provided by the server, it is easy to be unable to meet the demand. Therefore, P2P needs to be introduced to reduce server pressure and meet the demand for large-scale concurrent image pulling.

Investigation of image distribution system based on P2P technology

There are several P2P projects in the open source community. Here is a brief introduction to these projects.

Dragonfly

Architecture

Diagram flow showing Dragonfly architecture

Manager

  • Stores dynamic configuration for consumption by seed peer cluster, scheduler cluster and dfdaemon.
  • Maintain the relationship between seed peer cluster and scheduler cluster.
  • Provide async task management features for image preheat combined with harbor.
  • Keepalive with scheduler instance and seed peer instance.
  • Filter the optimal scheduler cluster for dfdaemon.
  • Provides a visual console, which is helpful for users to manage the P2P cluster.

Scheduler

  • Based on the multi-feature intelligent scheduling system selects the optimal parent peer.
  • Build a scheduling directed acyclic graph for the P2P cluster.
  • Remove abnormal peer based on peer multi-feature evaluation results.
  • In the case of scheduling failure, notice peer back-to-source download.

Dfdaemon

  • Serve gRPC for dfget with downloading feature, and provide adaptation to different source protocols.
  • It can be used as seed peer. Turning on the Seed Peer mode can be used as a back-to-source download peer in a P2P cluster, which is the root peer for download in the entire cluster.
  • Serve proxy for container registry mirror and any other http backend.
  • Download object like via http, https and other custom protocol.

Kraken

Architecture

Diagram flow showing Kraken architecture

Agent

  • Is a peer node in a P2P network and needs to be deployed on each node
  • Implemented the docker registry interface
  • Notify the tracker of the data they own
  • Download the data of other agents (the tracker will tell the agent which agent to download this data from)

Origin

  • Responsible for reading data from storage for seeding
  • Support for different storage
  • High availability in the form of a hash ring

Tracker

  • A coordinator in a P2P network, tracking who is a peer and who is a seeder
  • Track data owned by peers
  • Provide ordered peer nodes for peers to download data
  • High availability in the form of a hash ring

Proxy

  • Implemented the docker registry interface
  • Pass the image layer to the Origin component
  • Pass the tag to the build-index component

Build-Index

  • Tag and digest mapping, when the agent downloads the corresponding tag data, it obtains the corresponding digest value from Build-Index
  • image replication between clusters
  • Save tag data in storage
  • High availability in the form of a hash ring

Dragonfly vs Kraken

DragonflyKraken
High availabilityScheduler consistent hash ring supports high availabilityTracker consistent hash ring, multiple replicas ensure high availability
Containerd supportSupportSupport
HTTPS image repositorySupportSupport
Community active levelActiveInactive
Number of usersMoreLess
MaturityHighHigh
Is it optimized for NydusYesNo
Architecture complexityMiddleMiddle

Summary

Based on the overall maturity of the project, community active level, number of users, architecture complexity, whether it is optimized for Nydus , future development trends and other factors, Dragonfly is the best choice in P2P projects.

Proposal

For Volcano Engine, the main consideration is that VKE and VCI pull images through CR.

  • The product feature of VKE is K8s deployed based on ECS, so it is very suitable to deploy dfdaemon on each node, fully utilize the bandwidth of each node, and then fully utilize the capability of P2P.
  • The product feature of VCI is that there are some virtual nodes with abundant resources at the bottom layer. The upper layer service is based on POD as the carrier, so it is impossible to deploy dfdaemon on each node like VKE, so the deployment form deploys several dfdaemon as cache, using the cache capability.
  • VKE or VCI client pulls images that have been converted by Nydus format. In this scenario, dfdaemon needs to be used as a cache, and not too many nodes should be used to avoid putting too much scheduling pressure on the Scheduler.

Based on Volcano Engine’s demand for the above products, and combined with Dragonfly’s characteristics, a deployment scheme compatible with many factors needs to be designed. The scheme for deploying Dragonfly is designed as follows.

Architecture

Diagram flow showing Volcano Engine architecture combined with Dragonfly's characteristics

  • Volcano Engine resources belong to the main account, P2P control components divided by the main account level isolation, each master account under a set of P2P control components. server level implementation of P2PManager controller, through the controller to control the control plane of all P2P cgroup parts
  • P2P control components are deployed in CR data plane VPC , through LB exposed to user cluster
  • On a VKE cluster, Dfdaemons are deployed as DaemonSets, with one Dfdaemon deployed on each node.
  • On VCI , Dfdaemon is deployed as Deployment
  • Containerd on ECS accesses Dfdaemon on this node via 127.0.0.1:65001
  • Through a controller component in the user Clustered Deployment , based on the PrivateZone function, generate <clusterid>.p2p.volces.com domain name in the user cluster, the controller will select the Dfdaemon  pod of a specific node (including VKE , VCI ) according to certain rules, and resolve to the above domain name in the form of A record.
  • ECS on Nydusd by <clusterid>.p2p.volces.com domain name access Dfdaemon
  • The image service Client and Nydusd on VCI access Dfdaemon via <clusterid>.p2p.volces.com domain name

Benchmark

Environment

Container Repository : Bandwidth 10Gbit/s

Dragonfly Scheduler: 2 Replicas,Request 1C2G,Limit 4C8G, Bandwidth 6Gbit/s

Dragonfly Manager: 2 Replicas,Request 1C2G,Limit 4C8G, Bandwidth 6Gbit/s

Dragonfly Peer : Limit 2C6G, Bandwidth 6Gbit/s, SSD

Image

Nginx(500M)

TensorFlow(3G)

Component Version

Dragonfly v2.0.8

POD Creation to Container Start

Nginx  pods concurrently consume time from creation to startup for all pods of 50, 100, 200, and 500

Bar chart showing NGinx Pod Creation to Container Start divided by 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1. Dragonfly, Dragonfly &amp; Nydus

TensorFlow  pods concurrently consume time from creation to startup for all pods of 50, 100, 200, 500, respectively

Bar chart showing TensorFlow Pod Creation to Container Start divided by 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1, Dragonfly ad Dragonfly &amp; Nydus

In large-scale image scenarios, using Dragonfly and Dragonfly & Nydus scenarios can save more than 90% of container startup time compared to OCIv1 scenarios. The shorter startup time after using Nydus is due to the lazyload feature, which only needs to pull a small part of the metadata  Pod to start.

Back-to-source Peak Bandwidth on Container Registry

Nginx Pod concurrent storage peak traffic of 50, 100, 200, and 500, respectively

Bar Chart showing impact of Nginx on Container Registry divided in 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1, Dragonfly

TensorFlow  Pod concurrent storage peak traffic of 50, 100, 200, 500, respectively

Bar Chart showing impact of TensorFlow on Container Registry divided in 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1, Dragonfly

Back-to-source Traffic on Container Registry

Nginx Pod concurrent 50, 100, 200, 500 back to the source traffic respectively

Bar Chart showing impact of Nginx on Container Registry divided in 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1, Dragonfly

TensorFlow  Pod concurrent 50, 100, 200, 500 back to the source traffic respectively

Bar Chart showing impact of TensorFlow on Container Registry divided in 50 Pods, 100 Pods, 200 Pods and 500 Pods in OCI v1, Dragonfly

In large-scale scenarios, using Dragonfly back to the source pulls a small number of images, and all images in OCIv1 scenarios have to be back to the source, so using Dragonfly back to the source peak and back to the source traffic is much less than OCIv1. And after using Dragonfly, as the number of concurrency increases, the peak and traffic back to the source will not increase significantly.

Reference

Volcano Engine https://www.volcengine.com/

Volcano Engine VKE https://www.volcengine.com/product/vke

Volcano Engine CR https://www.volcengine.com/product/cr

Dragonfly https://d7y.io/

Dragonfly Github Repo https://github.com/dragonflyoss/dragonfly

Nydus https://nydus.dev/

Nydus Gihtub Repo https://github.com/dragonflyoss/image-service

Dragonfly v2.0.9 is released

· 5 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Project post originally published on Github by Dragonfly maintainers

Dragonfly provide efficient, stable, secure file distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project

Dragonfly v2.0.9 is released! 🎉🎉🎉 Thanks to the Google Cloud Platform (GCP) Team, Volcano Engine Team, and Baidu AI Cloud Team for helping Dragonfly integrate with their public clouds. Welcome to visit d7y.io website.

Github snippit

Features

  • Download tasks based on priority. Priority can be passed as parameter during the download task, or can be associated with priority in the application of the Manager console, refer to priority protoc definition.
  • Scheduler adds PieceDownloadTimeout parameter, which indicates that if the piece download times out, the scheduler will change the task state to TaskStateFailed.
  • Add health service to each GRPC service.
  • Add reflection to each GRPC service.
  • Manager supports redis sentinal model.
  • Refactor dynconfig package to remove json.Unmarshal, improving its runtime efficiency.
  • Fix panic caused by hashring not being built.
  • Previously, most of the pieces were downloaded from the same parent. Now, different pieces are downloaded from different parents to improve download efficiency and distribute bandwidth among multiple parents.
  • If Manager’s searcher can not found candidate scheduler clusters, It will return all the clusters for peers to check health. If check health is successful, the scheduler cluster can be used.
  • Support ORAS source client to pull image.
  • Add UDP ping package and GRPC protoc definition for building virtual network topology.
  • The V2 P2P protocol has been added, and both Scheduler and Manager have implemented the API of the V2 P2P protocol, in preparation for the future Rust version of Dfdaemon.
  • OSS source client supports STS access, user can set security token in header.
  • Dynconfig supports to resolve addresses with health service.
  • Add hostTTL and hostGCInterval in Scheduler to prevent information of abnormally exited Dfdaemon from becoming dirty data in the Scheduler.
  • Add CIDR to searcher to provide more precise scheduler cluster selection for Dfdaemon.
  • Refactor the metric definitions for the V1 P2P protocol and add the metric definitions for the V2 P2P protocol. Additionally, reorganize the Dragonfly Grafana Dashboards, refer to monitoring.

Break Change

  • Using the default value for the key used to generate JWT tokens in Manager can lead to security issues. Therefore, Manager has added JWT Key in the configuration, and upgrading Manager requires generating a new JWT Key and setting it in the Manager configuration.

Public Cloud Providers

Others

You can see CHANGELOG for more details.

The Evolution of the Nydus Image Acceleration

· 14 min read
Jingbo Xu

The Evolution of the Nydus Image Acceleration

Optimized container images together with technologies such as P2P networks can effectively speed up the process of container deployment and startup. In order to achieve this, we developed the Nydus image acceleration service (also a sub-project of CNCF Dragonfly).

In addition to startup speed, core features such as image layering, lazy pulling etc. are also particularly important in the field of container images. But since there is no native filesystem supporting that, most opt for the userspace solution, and Nydus initially did the same. However user-mode solutions are encountering more and more challenges nowadays, such as a huge gap in performance compared with native filesystems, and noticable resource overhead in high-density employed scenarios.

Therefore, we designed and implemented the RAFS v6 format which is compatible with the in-kernel EROFS filesystem, hoping to form a content-addressable in-kernel filesystem for container images. After the lazy-pulling technology of "EROFS over Fscache" was merged into 5.19 kernel, the next-generation architecture of Nydus is gradually becoming clear. This is the first native in-kernel solution for container images, promoting a high-density, high-performance and high-availability solution for container images.

This article will introduce the evolution of Nydus from three perspectives: Nydus architecture outline, RAFS v6 image format and "EROFS over Fscache" on-demand loading technology.

Please refer to Nydus for more details of this project. Now you can experience all these new features with this user guide.

Nydus Architecture Outline

In brief, Nydus is a filesystem-based image acceleration service that designs the RAFS (Registry Acceleration File System) disk format, optimizing the startup performance of OCIv1 container images.

The fundamental idea of the container image is to provide the root directory (rootfs) of the container, which can be carried by the filesystem or the archive format. Besides, it can also be implemented together with a custom block format, but anyway it needs to present as a directory tree, providing the file interface to containers.

Let's take a look at the OCIv1 standard image format first. The OCIv1 format is an image format specification based on the Docker Image Manifest Version 2 Schema 2. It consists of a manifest, an image index (optional), a series of container image layers and configuration files. Essentially, OCIv1 is a layer-based image format, with each layer storing file-level diff data in tgz archive format. ociv1

Due to the limitation of tgz, OCIv1 has some inherent issues, such as inability to load on demand, coarser level deduplication granularity, unstable hash digest for each layer, etc.

As for the custom block format, it also has some flaws by design.

  • Since the container image should be eventually presented as a directory tree, a filesystem (such as ext4) is needed upon that. In this case the dependency chain is "custom block format + userspace block device + filesystem", which is obviously more complex compared to the native filesystem solution;
  • Since the block format is not aware of the upper filesystem, it would be impossible to distinguish the metadata and data of the filesystem and process them separately (such as compression);
  • Similarly, it is unable to implement file-based image analysis features such as security scanning, hotspot analysis, and runtime interception, etc.;
  • Unable to directly merge multiple existing images into one large image without modifying the blob image, which is the natural ability of the filesystem solution.

Therefore, Nydus is a filesystem-based container image acceleration solution. It introduces a RAFS image format, in which data (blobs) and metadata (bootstrap) of the image are separated, whilst the image layer only stores the data part. Files within the image are divided into chunks for deduplication, with each image layer storing the corresponding chunk data. Since then, the chunk-level deduplication is allowed between layers and images. Besides it also helps implement on-demand loading. Since the metadata is separated from data and then combined into one place, the access to the metadata does not need to pull the corresponding data, which speeds up the file access quite a lot.

The Nydus RAFS image format is shown below: nydus_rafs

RAFS v6 image format

Evolution of RAFS image format

Prior to the introduction of RAFS v6 format, Nydus used to handle a fully userspace implemented image format, working via FUSE or virtiofs. However, the userspace filesystem has the following defects:

  • The overhead of large amounts of system call cannot be ignored, especially in the case of random small I/Os with depth 1;
  • Frequent file operations will generate a large number of FUSE requests, resulting in frequent switching of kernel/user mode context, which becomes the performance bottleneck then;
  • In non-FSDAX scenarios, the buffer copy from user to kernel mode will consume CPUs;
  • In the FSDAX (via virtiofs) scenario, a large number of small files will occupy considerable DAX window resources, resulting in potential performance jitter; frequent switching between small files will also generate noticeable DAX mapping setup overhead.

Essentially these problems are caused by the natural limitations of the userspace filesystem solution, and if the container filesystem is an in-kernel filesystem, the problems above can be resolved in practice. Therefore, we introduced RAFS v6 image format, a container image format implemented in kernel based on EROFS filesystem.

Introduction to EROFS filesystem

EROFS filesystem has been in the Linux mainline since the Linux 4.19. In the past, it was mainly used for mobile devices. It exists in the current major distributions (such as Fedora, Ubuntu, Archlinux, Debian, Gentoo, etc.). The userspace tools erofs-utils also already exists in these distributions and the OIN Linux system definition list, and the community is quite active.

EROFS filesystem has the following characteristics:

  • Native local read-only block-based filesystem suitable for various scenarios, the disk format has the minimum I/O unit definition;
  • Page-sized block-aligned uncompressed metadata;
  • Effective space saving through Tail-packing inline technology while keeping high performance;
  • Data is addressed in blocks (mmap I/O friendly, no post I/O processing required);
  • Disk directory format friendly for random access;
  • Simple on-disk format, easy to increase the payload, better scalability;
  • Support DIRECT I/O access; support block devices, FSDAX and other backends;
  • A boot sector is reserved, which can help bootstrap and other requirements.

Introduction to RAFS v6 image format

Over the past year, the Alibaba Cloud kernel team has made several improvements and enhancements to EROFS filesystem, adapting it to the container image storage scenarios, and finally presenting it as a container image format implemented on the kernel side, RAFS v6. In addition, RAFS v6 also carries out a series of optimizations on the image format, such as block alignment, more compact metadata, and more.

The new RAFS v6 image format is as follows: rafsv6

The improved Nydus image service architecture is illustrated as below, adding support for the (EROFS-based) RAFS v6 image format: rafsv6_arch

EROFS over Fscache

erofs over fscache is the next-generation container image on-demand loading technology developed by the Alibaba Cloud kernel team for Nydus. It is also the native image on-demand loading feature of the Linux kernel. It was integrated into the Linux kernel mainline 5.19. erofs_over_fscache_merge

And on LWN.net as a highlighting feature of the 5.19 merge window: erofs_over_fscache_lwn

Prior to this, almost all lazy pulling solutions available were in the user mode. The userspace solution involves frequent kernel/user mode context switching and memory copying between kernel/user mode, resulting in performance bottlenecks. This problem is especially prominent when all the container images have been downloaded locally, in which case the file access will still switch to userspace.

In order to avoid the unnecessary overhead, we can decouple the two operations of 1) cache management of image data and 2) fetching data through various sources (such as network) on cache miss. Cache management can be all in the kernel mode, so that it can avoid kernel/user mode context switching when the image is locally ready. This is exactly the main benefit of erofs over fscache technology.

Brief Introduction

fscache/cachefiles (hereinafter collectively referred to as fscache) is a relatively mature file caching solution in Linux systems, and is widely used in network filesystems (such as NFS, Ceph, etc.). Our attempt is to make it work with the on-demand loading for local filesystems such as EROFS.

In this case, when the container accesses the container image, fscache will check whether the requested data has been cached. On cache hit, the data will be read directly from the cache file. It is processed directly in kernel, and will not switch to userspace. erofs_over_fscache_cache_hit

Otherwise (cache miss), the userspace service Nydusd will be notified to process this request, while the container process will sleep on this then; Nydusd will fetch data from remote, write it to the cache file through fscache, and awake the original asleep process. Once awaken, the process is able to read the data from the cache file. erofs_over_fscache_cache_miss

Advantages of the Solution

As described above, when the image has been downloaded locally, userspace solutions still need to switch to userspace when accessing, while the memory copying overhead between kernel/user modes is also involved. As for erofs over fscache, it will no longer switch to userspace, so that on-demand loading is truly "on-demand". IOWs, it has native performance and stability when images has been locally ready. In brief, it implements a real one-stop and lossless solution in the following two scenarios of 1) on-demand loading and 2) downloading container images in advance.

Specifically, erofs over fscache has the following advantages over userspace solutions.

1. Asynchronous prefetch

After the container is created, Nydusd can start to download images even without the on-demand loading (cache miss) triggered. Nydusd will download data and write it to the cache file. Then when the specific file range is accessed, EROFS will directly read from the cache file, without switching to the userspace, whilst the other userspace solutions have to go the round trip. erofs_over_fscache_prefetch

2. Network IO optimization

When on-demand loading (cache miss) is triggered, Nydusd can download more data at one time than requested. For example, when 4KB I/O is requested, Nydusd can actually download 1MB of data at a time to reduce the network transmission delay per unit file size. Then, when the container accesses the remaining data within this 1MB, it won't switch to userspace anymore. The userspace solutions cannot work like this since it still needs to switch to the userspace on data access within the prefetched range. erofs_over_fscache_readahead

3. Better performance

When images have been downloaded locally (the impact of on-demand loading is not considered in this case), erofs over fscache performs significantly better than userspace solutions, while achieving similar performance compared to the native filesystem. Here is the performance statistics under several workloads as below[1].

read/randread IO

The following is the performance statistics of file read/randread buffered IO [2]

readIOPSBWperformance
native ext4267K1093MB/s1
loop240K982MB/s0.90
fscache227K931MB/s0.85
fuse191K764MB/s0.70
randreadIOPSBWPerformance
native ext410.1K41.2MB/s1
loop8.7K34.8MB/s0.84
fscache9.5K38.2MB/s0.93
fuse7.6K31.2MB/s0.76
  • "native" means that the test file is directly on the local ext4 filesystem
  • "loop" means that the test file is inside a erofs image, while the erofs image is mounted through the DIRECT IO mode of the loop device
  • "fscache" means that the test file is inside a erofs image, while the erofs image is mounted through the erofs over fscache scheme
  • "fuse" means that the test file is in the fuse filesystem [3]
  • The "Performance" column normalizes the performance statistics of each mode, based on the performance of the native ext4 filesystem

It can be seen that the read/randread performance in fscache mode is basically the same as that in loop mode, and is better than that in fuse mode; however, there is still a certain gap with the performance of the native ext4 file system. We are further analyzing and optimizing it. In theory, it can achieve basically lossless performance with that of native filesystem.

File metadata manipulation

Test the performance of file metadata operations by performing a tar operation [4] on a large number of small files.

TimePerformance
native ext41.04s1
loop0.550s1.89
fscache0.570s1.82
fuse3.2s0.33

It can be seen that the erofs format is even better than that of the native ext4 filesystem, which is caused by the optimized filesystem format of erofs. Since erofs is a read-only filesystem, all its metadata can be closely arranged, while ext4 is a writable filesystem, and its metadata is scattered among multiple BGs (block group) .

Typical workload

Test the performance of linux source code compilation [5] as the typical workload.

Linux CompilingTimePerformance
native ext4156s1
loop154s1.0
fscache156s1.0
fuse200s0.78

It can be seen that fscache mode is basically the same as that of loop mode and native ext4 filesystem, and is better than fuse mode.

4. High-density deployment

Since the erofs over fscache technology is implemented based on files, i.e. each container image is represented as a cache file under fscache, it naturally supports high-density deployment scenarios. For example, a typical node.js container image corresponds to ~20 cache files under this scheme, then in a machine with hundreds of containers deployed, only thousands of cache files need to be maintained.

5. Failover and Hot Upgrade

When all the image files have been downloaded locally, the file access will no longer require the intervention of the user-mode service process, in which case the user-mode service process has a more abundant time window to realize the functions of failure recovery and hot upgrade. The user-mode processes are even no longer required in this scenario, which promotes the stability of the solution.

6. An one-stop solution for container image

With RAFS v6 image format and erofs over fscache on-demand loading technology, Nydus is suitable for both runc and Kata as a one-stop solution for container image distribution in these two scenarios.

More importantly, erofs over fscache is a truly a one-stop and lossless solution in the following two scenarios of 1) on-demand loading and 2) downloading container images in advance. On the one hand, with the on-demand loading feature implemented, it can significantly speed up the container startup, as it does not need to download the complete container images to the local. On the other hand, it is compatible with the scenario where the container image has been downloaded locally. It will no longer switch to userspace in this case, so as to achieve almost lossless performance and stability with the native filesystem.

The Future

After that, we will keep improving the erofs over fscache technology, such as more fine-grained image deduplication among containers, stargz support, FSDAX support, and performance optimization.

Last but not least, I would like to thank all the individuals and teams who have supported and helped us during the development of the project, and specially thanks to ByteDance and Kuaishou folks for their solid support. Let us work together to build a better container image ecosystem :)

  1. Test environment: ecs.i2ne.4xlarge (16 vCPU, 128 GiB Mem, local NVMe disk)
  2. Test command "fio -ioengine=psync -bs=4k -direct=0 -rw=[read|randread] -numjobs=1"
  3. Use passthrough_hp as fuse daemon
  4. Test the execution time of "tar -cf /dev/null linux_src_dir" command
  5. Test the execution time of the "time make -j16" command

Containerd Accepted Nydus-snapshotter

· 4 min read
Changwei Ge

Containerd Accepted Nydus-snapshotter

Early January, Containerd community has taken in nydus-snapshotter as a sub-project. Check out the code, particular introductions and tutorial from its new repository. We believe that the donation to containerd will attract more users and developers for nydus itself and bring much value to the community users.

Nydus-snapshotter is a containerd's remote snapshotter, it works as a standalone process out of containerd, which only pulls nydus image's bootstrap from remote registry and forks another process called nydusd. Nydusd has a unified architecture, which means it works in form of a FUSE user-space filesystem daemon, a virtio-fs daemon or a fscache user-space daemon. Nydusd is responsible for fetching data blocks from remote storage like object storage or standard image registry, thus to fulfill containers' requests to read its rootfs.

Nydus is an excellent container image acceleration solution which significantly reduces time cost by starting container. It is originally developed by a virtual team from Alibaba Cloud and Ant Group and deployed in very large scale. Millions of containers are created based on nydus image each day in Alibaba Cloud and Ant Group. The underlying technique is a newly designed, container optimized and oriented read-only filesystem named Rafs. Several approaches are provided to create rafs format container image. The image can be pushed and stored in standard registry since it is compatible with OCI image and distribution specifications. A nydus image can be converted from a OCI source image where metadata and files data are split into a "bootstrap" and one or more "blobs" together with necessary manifest.json and config.json. Development of integration with Buildkit is in progress.

rafs disk layout

Nydus provides following key features:

  • Chunk level data de-duplication among layers in a single repository to reduce storage, transport and memory cost
  • Deleted(whiteout) files in certain layer aren't packed into nydus image, therefore, image size may be reduced
  • E2E image data integrity check. So security issues like "Supply Chain Attack" can be avoided and detected at runtime
  • Integrated with CNCF incubating project Dragonfly to distribute container images in P2P fashion and mitigate the pressure on container registries
  • Different container image storage backends are supported. For example, Registry, NAS, Aliyun/OSS and applying other remote storage backend like AWS S3 is also possible.
  • Record files access pattern during runtime gathering access trace/log, by which user's abnormal behaviors are easily caught. So we can ensure the image can be trusted

Beyond above essential features, nydus can be flexibly configured as a FUSE-base user-space filesystem or in-kernel EROFS with an on-demand loader user-space daemon and integrating nydus with VM-based container runtime is much easier.

  • Lightweight integration with VM-based containers runtime like KataContainers. In fact, KataContainers is considering supporting nydus as a native image acceleration solution.
  • Nydus closely cooperates with Linux in-kernel disk filesystem Containers' rootfs can directly be set up by EROFS with lazy pulling capability. The corresponding changes had been merged into Linux kernel since v5.16

To run with runc, nydusd works as FUSE user-space daemon:

runc nydus

To work with KataContainers, it works as a virtio-fs daemon:

kata nydus

Nydus community is working together with Linux Kernel to develop erofs+fscache based user-space on-demand read.

runc erofs nydus

Nydus and eStargz developers are working together on a new project named acceld in Harbor community to provide a general service to support the conversion from OCI v1 image to kinds of acceleration image formats for various accelerator providers, so that keep a smooth upgrade from OCI v1 image. In addition to the conversion service acceld and the conversion tool nydusify, nydus is also supporting buildkit to enable exporting nydus image directly from Dockerfile as a compression type.

In the future, nydus community will work closely with the containerd community on fast and efficient methods and solution of distributing container images, container image security, container image content storage efficiency, etc.

Introducing Nydus – Dragonfly Container Image Service

· 8 min read

Guest post by Pengtao and Liubo, Software Engineers at Ant Group

Tao is a software engineer at Ant Group. He has been working on Linux file system development for more than 10 years. He is also a core maintainer of Kata Containers project. In recent years, Tao mainly works on container runtime and services. He is a strong believer and advocator for open source and cloud native technology_

Bo Liu, he has been an active contributor of Linux kernel since 2009, mostly working on the Btrfs Filesystem, and now he is working at Alibaba Group, his main interest is linux filesystems and container technologies.

Small is Fast, Large is Slow

With containers, it is relatively fast to deploy web apps, mobile backends, and API services right out of the box. Why? Because the container images they use are generally small (hundreds of MB).

A larger challenge is deploying applications with a huge container image (several GB). It takes a good amount of time to have these images ready to use. We want the time spent shortened to a certain extent to leverage the powerful container abstractions to run and scale the applications fast.

Dragonfly has been doing well at distributing container images. However, users still have to download an entire container image before creating a new container.

Another big challenge is arising security concerns about container image.

Conceptually, we pack application’s environment into a single image that is more easily shared with consumers. Image is then put into a filesystem locally on top of which an application can run. The pieces that are now being launched as nydus are the culmination of the years of work and experience of our team in building filesystems.

Here we introduce the dragonfly image service called nydus as an extension to the Dragonfly project.  It’s software that minimizes download time and provides image integrity check across the whole lifetime of a container, enabling users to manage applications fast and safely.

nydus is co-developed by engineers from Alibaba Cloud and Ant Group. It is widely used in the internal production deployments. From our experience, we value its container creation speedup and image isolation enhancement the most. And we are seeing interesting use cases of it from time to time.

Nydus: Dragonfly Image Service

The nydus project designs and implements an user space filesystem on top of a container image format that improves over the current OCI image specification. Its key features include:

  • Container images are downloaded on demand
  • Chunk level data duplication
  • Flatten image metadata and data to remove all intermediate layers
  • Only usable image data is saved when building a container image
  • Only usable image data is downloaded when running a container
  • End-to-end image data integrity
  • Compatible with the OCI artifacts spec and distribution spec
  • Integrated with existing CNCF project dragonfly to support image distribution in large clusters
  • Different container image storage backends are supported

Nydus mainly consists of a new container image format and a FUSE (Filesystem in USErspace) daemon to translate it into container accessible mountpoint.

Nydus architecture

The FUSE daemon takes in either FUSE or virtiofs protocol to service POD created by conventional runc containers or Kata Containers. It supports pulling container image data from container image registry, OSS, NAS, as well as Dragonfly supernode and node peers. It can also optionally use a local directory to cache all container image data to speed up future container creation.

Internally, nydus splits a container image into two parts: a metadata layer and a data layer. The metadata layer is a self-verifiable merkle tree. Each file and directory is a node in the merkle tree with a hash aloneside. A file’s hash is the hash of its file content, and a directory’s hash is the hash of all of its descendents. Each file is divided into even sized chunks and saved in a data layer. File chunks can be shared among different container images by letting file nodes pointing inside them point to the same chunk location in the shared data layer.

Nydus architecture

How can you benefit from nydus?

The immediate benefit of running nydus image service is that users can launch containers almost instantly. In our tests, we found out that nydus can boost container creation from minutes to seconds.

Nydus performance chart

Another less-obvious but important benefit is runtime data integration check. With OCIv1 container images, the image data cannot be verified after being unpacked to local directory, which means if some files in the local directories are undermined either intentionally or not, containers will simply take them as is, incurring data leaking risk. In contrast, nydus image won’t be unpacked to local directory at all, what’s more, given that verification can be enforced on every data access to nydus image, the data leak risk can be completely avoided by forcing to fetch the data from the trusted image registry again.

Nydus architecture

The Future of Nydus

The above examples showcase the power of nydus. For the last year, we’ve worked alongside the production team, laser-focused on making nydus stable, secure, easy to use.

Now, as the foundation for nydus has been laid, our new focus is the ecosystem it aims to serve broadly. We envision a future where users install dragonfly and nydus on their clusters, run containers with large image as fast they do with regular size image today, and feel confident about the safety of data on their container image.

While we have widely deployed nydus in our production, we believe a proper upgrade to OCI image spec shouldn’t be built without the community. To this end, we propose nydus as a reference implementation that aligns well with the OCI image spec v2 proposal [1], and we look forward to working with other industry leaders should this project come to fruition.

FAQ

Q: What are the challenges with oci image spec v1?

Q: How is this different than crfs?

  • The basic idea of the two are quite similar. Deep down, the nydus image format supports chunk level data deduplication and end-to-end data integrity at runtime, which is an improvement over the stargz format used by crfs.

Q: How is this different than Teleport of Azure?

  • Azure Teleport is like the current OCI image format plus a SMB-enabled snapshotter. It supports container image lazy-fetching and suffers from all the Tar format defects. OTOH, nydus deprecates the legacy Tar format and takes advantage of the merkle tree format to provide more advantages over the Tar format.

Q: What if network is down while container is running with nydus?

  • With OCIv1, container would fail to start at all should network be down while container image is not fully downloaded. Nydus has changed that a lot because it goes with lazy fetch/load mechanism, a failure in network may take down a running container. Nydus addresses the problem with a prefetch mechanism which can be configured to run in background right after starting a container.

[1]:OCI Image Specification V2 Requirements

In the mean time, the OCI (Open Container Initiate) community has been actively discussing the emerging of OCI image spec v2 aiming to address new challenges with oci image spec v1.

Starting from June 2020, the OCI community spent more than a month discussing the requirements for OCI image specification v2. It is important to notice that OCIv2 is just a marketing term for updating the OCI specification to better address some use cases. It is not a brand new specification.

The discussion went from an email thread (Proposal Draft for OCI Image Spec V2) and a shared document to several OCI community online meetings, and the result is quite aspiring. The concluded OCIv2 requirements are:

  • Reduced Duplication
  • Canonical Representation (Reproducible Image Building)
  • Explicit (and Minimal) Filesystem Objects and Metadata
  • Mountable Filesystem Format
  • Bill of Materials
  • Lazy Fetch Support
  • Extensibility
  • Verifiability and/or Repairability
  • Reduced Uploading
  • Untrusted Storage

For detailed meaning of each requirement, please refer to the original shared document. We actively joined the community discussions and found out that the nydus project fits nicely to these requirements. It further encouraged us to opensource the nydus project to help the community discussion with a working code base.

TOC votes to move Dragonfly into CNCF incubator

· 4 min read

This post was migrated by mingcheng from the CNCF Blog, the orginal post can be found here.

Today, the CNCF Technical Oversight Committee (TOC) voted to accept Dragonfly as an incubation-level hosted project.

Dragonfly, which was accepted into the CNCF Sandbox in October 2018, is an open source, cloud native image and file distribution system. Dragonfly was created in June 2015 by Alibaba Cloud to improve the user experience of image and file distribution in Kubernetes. This allows engineers in enterprises to focus on the application itself rather than infrastructure management.

“Dragonfly is one of the backbone technologies for container platforms within Alibaba’s ecosystem, supporting billions of application deliveries each year, and in use by many enterprise customers around the world,” said, Li Yi, senior staff engineer, Alibaba. “Alibaba looks forward to continually improving Dragonfly, making it more efficient and easier to use.”

The goal of Dragonfly is to tackle distribution problems in cloud native scenarios. The project is comprised of three main components: supernode plays the role of central scheduler and controls all distribution procedure among the peer network; dfget resides on each peer as an agent to download file pieces; and “dfdaemon” plays the role of proxy which intercepts image downloading requests from container engine to dfget.

“Dragonfly improves the user experience by taking advantage of a P2P image and file distribution protocol and easing the network load of the image registry,” said Sheng Liang, TOC member and project sponsor. “As organizations across the world migrate their workloads onto container stacks, we expect the adoption of Dragonfly to continue to increase significantly.”

Dragonfly integrates with other CNCF projects, including Prometheus, containerd, Habor, Kubernetes, and Helm. Project maintainers come from Alibaba, ByteDance, eBay, and Meitu, and there are more than 20 contributing companies, including NetEase, JD.com, Walmart, VMware, Shopee, ChinaMobile, Qunar, ZTE, Qiniu, NVIDIA, and others.

Main Dragonfly Features:

  • P2P based file distribution: Using P2P technology for file transmission, which can make full use of the bandwidth resources of each peer to improve download efficiency, saves a lot of cross-IDC bandwidth, especially costly cross-board bandwidth.
  • Non-invasive support for all kinds of container technologies: Dragonfly can seamlessly support various containers for distributing images.
  • Host level speed limit: Many downloading tools (wget/curl) only have rate limit for the current download task, but dragonfly also provides a rate limit for the entire host.
  • Passive CDN: The CDN mechanism can avoid repetitive remote downloads.

Notable Milestones:

  • 7 project maintainers from 4 organizations
  • 67 contributors
  • 21 contributing organizations
  • 4.6k + GitHub stars
  • 100k + downloads in Docker Hub
  • 120% increase in commits last year

Since it joined the CNCF sandbox, Dragonfly has grown rapidly across industries including e-commerce, telecom, financial, internet, and more. Users include organizations like Alibaba, China Mobile, Ant Financial, Huya, Didi, iFLYTEK, and others.

“As cloud native adoption continues to grow, the distribution of container images in large scale production environments becomes an important challenge to tackle, and we are glad that Dragonfly shares some of those initial lessons learned at Alibaba,” said Chris Aniszczyk, CTO/COO of CNCF. “The Dragonfly project has made a lot of strides recently as it was completely rewritten in Golang for performance improvements, and we look forward to cultivating and diversifying the project community.”

In its latest version, Dragonfly 1.0.0, the project has been completely rewritten in Golang to improve ease of use with other cloud native technologies. Now Dragonfly brings a more flexible and scalable architecture, more cloud scenarios, and a potential integration with OCI (Open Container Initiative) to make image distribution more efficient.

“We are very excited for Dragonfly to move into incubation,” said Allen Sun, staff engineer at Alibaba and Dragonfly project maintainer. “The maintainers have been working diligently to improve on all aspects of the project, and we look forward to seeing what this next chapter will bring.”

As a CNCF hosted project, joining incubating technologies like OpenTracing, gRPC, CNI, Notary, NATS, Linkerd, Helm, Rook, Harbor, etcd, OPA, CRI-O, TiKV, CloudEvents, Falco, and Argo, Dragonfly is part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach.

Every CNCF project has an associated maturity level: sandbox, incubating, or graduated. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria v.1.3.

Learn more about Dragonfly, visit https://github.com/dragonflyoss/Dragonfly.