Skip to main content

Hugging Face accelerates distribution of models and datasets based on Dragonfly

· 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with hugging face. During the downloading of datasets or models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Dragonfly completes security audit!

· 3 min read

This summer, over four engineer weeks, Trail of Bits and OSTIF collaborated on a security audit of dragonfly. A CNCF Incubating Project, dragonfly functions as file distribution for peer-to-peer technologies. Included in the scope was the sub-project Nydus’s repository that works in image distribution. The engagement was outlined and framed around several goals relevant to the security and longevity of the project as it moves towards graduation.

The Trail of Bits audit team approached the audit by using static and manual testing with automated and manual processes. By introducing semgrep and CodeQL tooling, performing a manual review of client, scheduler, and manager code, and fuzz testing on the gRPC handlers, the audit team was able to identify a variety of findings for the project to improve their security. In focusing efforts on high-level business logic and externally accessible endpoints, the Trail of Bits audit team was able to direct their focus during the audit and provide guidance and recommendations for dragonfly’s future work.

Recorded in the audit report are 19 findings. Five of the findings were ranked as high, one as medium, four low, five informational, and four were considered undetermined. Nine of the findings were categorized as Data Validation, three of which were high severity. Ranked and reviewed as well was dragonfly’s Codebase Maturity, comprising eleven aspects of project code which are analyzed individually in the report.

This is a large project and could not be reviewed in total due to time constraints and scope. multiple specialized features were outside the scope of this audit for those reasons. this project is a great opportunity for continued audit work to improve and elevate code and harden security before graduation. Ongoing efforts for security is critical, as security is a moving target.

Using dragonfly to distribute images and files for multi-cluster kubernetes

· 14 min read

Posted on September 1, 2023

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly provides efficient, stable, securefile distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation(CNCF) as an Incubating Level Project.

This article introduces the deployment of dragonfly for multi-cluster kubernetes. A dragonfly cluster manages cluster within a network. If you have two clusters with disconnected networks, you can use two dragonfly clusters to manage their own clusters.

The recommended deployment for multi-cluster kubernetes is to use a dragonfly cluster to manage a kubernetes cluster, and use a centralized manager service to manage multiple dragonfly clusters. Because peer can only transmit data in its own dragonfly cluster, if a kubernetes cluster deploys a dragonfly cluster, then a kubernetes cluster forms a p2p network, and internal peers can only schedule and transmit data in a kubernetes cluster.

Screenshot showing diagram flow between Network A / Kubernetes Cluster A and Network B / Kubernetes Cluster B towards Manager

Ant Group security technology’s Nydus and Dragonfly image acceleration practices

· 23 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Introduction

ZOLOZ is a global security and risk management platform under Ant Group. Through biometric, big data analysis, and artificial intelligence technologies, ZOLOZ provides safe and convenient security and risk management solutions for users and institutions. ZOLOZ has provided security and risk management technology support for more than 70 partners in 14 countries and regions, including China, Indonesia, Malaysia, and the Philippines. It has already covered areas such as finance, insurance, securities, credit, telecommunications, and public services, and has served over 1.2 billion users.

With the explosion of Kubernetes and cloud-native, ZOLOZ applications have begun to be deployed on a large scale on public clouds using containerization. The images of ZOLOZ applications have been maintained and updated for a long time, and both the number of layers and the overall size have reached a large scale (hundreds of MBs or several GBs). In particular, the basic image size of ZOLOZ’s AI algorithm inference application is much larger than that of general application images (PyTorch/PyTorch:1.13.1-CUDA 11.6-cuDNN 8-Runtime on Docker Hub is 4.92GB, compared to CentOS:latest with only about 234MB).

For container cold start, i.e., when there is no image locally, the image needs to be downloaded from the registry before creating the container. In the production environment, container cold start often takes several minutes, and as the scale increases, the registry may be unable to download images quickly due to network congestion within the cluster. Such large images have brought many challenges to application updates and scaling. With the continuous promotion of containerization on public clouds, ZOLOZ applications mainly face three challenges:

  1. The algorithm image is large, and pushing it to the cloud image repository takes a long time. During the development process, when testing in the testing environment, developers often hope to iterate quickly and verify quickly. However, every time a branch is modified and released for verification, it takes several tens of minutes, which is very inefficient.
  2. Pulling the algorithm image takes a long time, and pulling many image files during cluster expansion can easily cause the cluster network card to be flooded and affect the normal operation of the business.
  3. The cluster machine takes a long time to start up, making it difficult to meet the needs of sudden traffic increases and elastic automatic scaling.

Although various compromise solutions have been attempted, these solutions all have their shortcomings. Now, in collaboration with multiple technical teams such as Ant Group, Alibaba Cloud, and ByteDance, a more universal solution on public clouds has been developed, which has low transformation costs and good performance, and currently appears to be an ideal solution.

Volcano Engine, distributed image acceleration practice based on Dragonfly

· 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Terms and definitions

TermDefinition
OCIThe Open Container Initiative is a Linux Foundation project launched by Docker in June 2015 to design open standards for operating system-level virtualization (and most importantly Linux containers).
OCI ArtifactProducts that follow the OCI image spec.
ImageThe image in this article refers to OCI Artifact
Image DistributionA product distribution implemented according to the OCI distribution spec.
ECSIt is a collection of resources composed of CPU, memory, and Cloud Drive, each of which logically corresponds to the computing hardware entity of the Data center infrastructure.
CRVolcano Engine image distribution service.
VKEVolcano Engine deeply integrates the new generation of Cloud Native technology to provide high-performance Kubernetes container cluster management services with containers as the core, helping users to quickly build containerized applications.
VCIVolcano is a serverless and containerized computing service. The current VCI seamlessly integrates with the Container Service VKE to provide Kubernetes orchestration capabilities. With VCI, you can focus on building the app itself, without having to buy and manage infrastructure such as the underlying Cloud as a Service, and pay only for the resources that the container actually consumes to run. VCI also supports second startup, high concurrent creation, sandbox Container Security isolation, and more.
TOSVolcano Engine provides massive, secure, low-cost, easy-to-use, highly reliable and highly available distributed cloud storage services.
Private ZonePrivate DNS service based on a proprietary network VPC (Virtual Private Cloud) environment. This service allows private domain names to be mapped to IP addresses in one or more custom VPCs.
P2PPeer-to-peer technology, when a peer in a P2P network downloads data from the server, it can also be used as a server level for other peers to download after downloading the data. When a large number of nodes download at the same time, it can ensure that the subsequent downloaded data does not need to be downloaded from the server side. Thereby reducing the pressure on the server side.
DragonflyDragonfly is a file distribution and image acceleration system based on P2P technology, and is the standard solution and best practice in the field of image acceleration in Cloud Native architecture. Now hosted as an incubation project by the Cloud Native Computing Foundation (CNCF).
NydusNydus Acceleration Framework implements a content-addressable filesystem that can accelerate container image startup by lazy loading. It has supported the creation of millions of accelerated image containers daily, and deeply integrated with the linux kernel's erofs and fscache, enabling in-kernel support for image acceleration.

Background

Volcano Engine image repository CR uses TOS to store container images. Currently, it can meet the demand of large-scale concurrent image pulling to a certain extent. However, the final concurrency of pulling is limited by the bandwidth and QPS of TOS.

Here is a brief introduction of the two scenarios that are currently encountered for large-scale image pulling:

  1. The number of clients is increasing, and the images are getting larger. The bandwidth of TOS will eventually be insufficient.
  2. If the client uses Nydus to convert the image format, the request volume to TOS will increase by an order of magnitude. The QPS limit of TOS API makes it unable to meet the demand.

Whether it is the image repository service itself or the underlying storage, there will be bandwidth and QPS limitations in the end. If you rely solely on the bandwidth and QPS provided by the server, it is easy to be unable to meet the demand. Therefore, P2P needs to be introduced to reduce server pressure and meet the demand for large-scale concurrent image pulling.

Dragonfly integrates nydus for image acceleration practive

· 11 min read
Gaius
Dragonfly Maintainer

Introduce definition

Dragonfly has been selected and put into production use by many Internet companies since its open source in 2017, and entered CNCF in October 2018, becoming the third project in China to enter the CNCF Sandbox. In April 2020, CNCF TOC voted to accept Dragonfly as an CNCF Incubating project. Dragonfly has developed the next version through production practice, which has absorbed the advantages of Dragonfly1.x and made a lot of optimizations for known problems.

Nydus optimized the OCIv1 image format, and designed a brand new image-based filesystem, so that the container can download the image on demand, and the container no longer needs to download the complete image to start the container. In the latest version, dragonfly has completed the integration with the nydus, allowing the container to start downloading images on demand, reducing the amount of downloads. The dragonfly P2P transmission method can also be used during the transmission process to reduce the back-to-source traffic and increase the speed.

The Evolution of the Nydus Image Acceleration

· 14 min read
Jingbo Xu

The Evolution of the Nydus Image Acceleration

Optimized container images together with technologies such as P2P networks can effectively speed up the process of container deployment and startup. In order to achieve this, we developed the Nydus image acceleration service (also a sub-project of CNCF Dragonfly).

In addition to startup speed, core features such as image layering, lazy pulling etc. are also particularly important in the field of container images. But since there is no native filesystem supporting that, most opt for the userspace solution, and Nydus initially did the same. However user-mode solutions are encountering more and more challenges nowadays, such as a huge gap in performance compared with native filesystems, and noticable resource overhead in high-density employed scenarios.

Therefore, we designed and implemented the RAFS v6 format which is compatible with the in-kernel EROFS filesystem, hoping to form a content-addressable in-kernel filesystem for container images. After the lazy-pulling technology of "EROFS over Fscache" was merged into 5.19 kernel, the next-generation architecture of Nydus is gradually becoming clear. This is the first native in-kernel solution for container images, promoting a high-density, high-performance and high-availability solution for container images.

This article will introduce the evolution of Nydus from three perspectives: Nydus architecture outline, RAFS v6 image format and "EROFS over Fscache" on-demand loading technology.

Please refer to Nydus for more details of this project. Now you can experience all these new features with this user guide.

Containerd Accepted Nydus-snapshotter

· 4 min read
Changwei Ge

Containerd Accepted Nydus-snapshotter

Early January, Containerd community has taken in nydus-snapshotter as a sub-project. Check out the code, particular introductions and tutorial from its new repository. We believe that the donation to containerd will attract more users and developers for nydus itself and bring much value to the community users.