Difference between revisions of "Academic Research"

From CRIU
Jump to navigation Jump to search
m
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
* Optimization of CRIU
+
* GPU CRIU
 +
 
 +
[SoCC '24] On-demand and Parallel Checkpoint/Restore for GPU Applications
 +
 
 +
[EuroSys '24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
 +
 
 +
[arXiv '23] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
 +
 
 +
[SC-W '23] Checkpoint/Restart for CUDA Kernels
 +
 
 +
[arXiv:2202.07848 '22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
 +
 
 +
[Wiley '21] Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support
 +
 
 +
[EuroSys '20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning
 +
 
 +
[HPEC '20] Using Container Migration for HPC Workloads Resilience
 +
 
 +
 
 +
 
 +
 
 +
 
 +
 
 
* CRIU for Migration
 
* CRIU for Migration
* CRIU for Security
 
* CRIU for Database
 
  
[VLDB'23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level
+
[APNet '24] Software-based Live Migration for Containerized RDMA
 +
 
 +
[SEATED '24] Live Migration of Multi-Container Kubernetes Pods in Multi-Cluster Serverless Edge Systems
 +
 
 +
[ICT '24] Packet Buffering to Minimize Service Downtime and Packet Loss During Redundancy Switchover
 +
 
 +
[SIGMOD/PODS '24] Demonstration of ElasticNotebook: Migrating Live Computational Notebook States
 +
 
 +
[ICDCS '24] Dapper: A Lightweight and Extensible Framework for Live Program State Rewriting
 +
 
 +
[Cloud '24] FastMig: Leveraging FastFreeze to Establish Robust Service Liquidity in Cloud 2.0
 +
 
 +
[CCGRID '24] Workload-Aware Live Migratable Cloud Instance Detector
 +
 
 +
[VLDB '23] ElasticNotebook: Enabling Live Migration for Computational Notebooks
 +
 
 +
[SRDS '23] Transparent Fault Tolerance for Stateful Applications in Kubernetes with Checkpoint/Restore
 +
 
 +
[ICFEC '23] Migration of Isolated Application Across Heterogeneous Edge Systems
 +
 
 +
[TNSM '23] Design, Modeling, and Implementation of Robust Migration of Stateful Edge Microservices
 +
 
 +
[WORDS '23] Evicting for the greater good: The case for Reactive Checkpointing in serverless computing
 +
 
 +
[Cloud Summit '23] Microservice Debugging with Checkpoint-Restart
 +
 
 +
[ICC '23] Processing-Aware Migration Model for Stateful Edge Microservices
 +
 
 +
[DRONES '23] A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System
 +
 
 +
[arXiv:2301.05861 '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level
 +
 
 +
[TOCS '22] H-Container: Enabling Heterogeneous-ISA Container Migration in Edge Computing
 +
 
 +
[VEE '22] Portkey: hypervisor-assisted container migration in nested cloud environments
 +
 
 +
[ICPADS '22] A Container Pre-copy Migration Method Based on Dirty Page Prediction and Compression
 +
 
 +
[NetSoft '22] Demonstration of Containerized Central Unit Live Migration in 5G Radio Access Network
 +
 
 +
[ATC '22] RRC: Responsive Replicated Containers
 +
 
 +
[HAL '22] Good Shepherds Care For Their Cattle: Seamless Pod Migration in Geo-Distributed Kubernetes
 +
 
 +
[ATC '21] MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications
 +
 
 +
[WoWMoM '21] Extending the QUIC Protocol to Support Live Container Migration at the Edge
 +
 
 +
[MobileCloud '20] Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart
 +
 
 +
 
 +
 
 +
* CRIU Acceleration
 +
 
 +
[EuroSys '24] Pronghorn: Effective Checkpoint Orchestration for Serverless Hot-Starts
 +
 
 +
[FGCS '24] Prebaking runtime environments to improve the FaaS cold start latency
 +
 
 +
[Middleware '23] DynaCut: A Framework for Dynamic and Adaptive Program Customization
 +
 
 +
[Virginia Tech '23] CRIU-RTX: Remote Thread eXecution using Checkpoint/Restore in Userspace
 +
 
 +
[Virginia Tech '23] HetMigrate: Secure and Efficient Cross-architecture Process Live Migration
 +
 
 +
[OSDI '23] No Provisioned Concurrency: Fast RDMA-codesigned Remote Fork for Serverless Computing
 +
 
 +
[SC '22] Out of hypervisor (OoH): efficient dirty page tracking in userspace using hardware virtualization features
 +
 
 +
[JNCA '22] iContainer: Consecutive checkpointing with rapid resilience for immortal container-based services
 +
 
 +
[VLSI '21] Standard-compliant parallel SystemC simulation of loosely-timed transaction level models: From baremetal to Linux-based applications support
 +
 
 +
[Middleware '20] Prebaking Functions to Warm the Serverless Cold Start
 +
 
 +
[MEMSYS '19] Fast in-memory CRIU for docker containers
 +
 
 +
[MCHPC '19] Optimizing Post-Copy Live Migration with System-Level Checkpoint Using Fabric-Attached Memory
 +
 
 +
 
 +
 
 +
 
 +
* CRIU Security
 +
 
 +
[APSys '24] Towards Efficient End-to-End Encryption for Container Checkpointing Systems
 +
 
 +
[eBPF '24] Custom Page Fault Handling With eBPF
 +
 
 +
[ARES '24] Don't, Stop, Drop, Pause: Forensics of CONtainer CheckPOINTs (ConPoint)
 +
 
 +
[ATC '22] RRC: Responsive Replicated Containers
 +
 
 +
[NDSS '22] FitM: Binary-Only Coverage-Guided Fuzzing for Stateful Network Protocols
 +
 
 +
[SYSTEX '22] Transparent, Cross-ISA Enclave Offloading
  
Abstract
+
[IPDPS '20] Fault-Tolerant Containers Using NiLiCon
  
In-memory key-value stores (IMKVSes) serve many online applications. They generally adopt the fork-based snapshot mechanism to support data backup. However, this method can result in query latency spikes because the engine is out-of-service for queries during the snapshot. In contrast to existing research optimizing snapshot algorithms, we address the problem from the operating system (OS) level, while keeping the data persistent mechanism in IMKVSes unchanged. Specifically, we first study the impact of the fork operation on query latency. Based on findings in the study, we propose Async-fork, which performs the fork operation asynchronously to reduce the out-of-service time of the engine. Async-fork is implemented in the Linux kernel and deployed into the online Redis database in public clouds. Our experiment results show that Async-fork can significantly reduce the tail latency of queries during the snapshot.
 
  
 +
* CRIU for Database
  
[EuroSys'21] On-demand-fork: a microsecond fork for memory-intensive and latency-sensitive applications
+
[Journal of Cloud Computing '24] MDB-KCP: persistence framework of in-memory database with CRIU-based container checkpoint in Kubernetes
  
Abstract
+
[VLDB '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level
  
Fork has long been the process creation system call for Unix. At its inception, fork was hailed as an efficient system call due to its use of copy-on-write on memory shared between parent and child processes. However, application memory demand has increased drastically since the early days and the cost incurred by fork to simply set up virtual memory (e.g., copy page tables) is now a concern, even for applications that only require hundreds of MBs of memory. In practice, fork performance already holds back system efficiency and latency across a range of uses cases that fork large processes, such as fault-tolerant systems, serverless frameworks, and testing frameworks.
+
[VLDB '23] ElasticNotebook: Enabling Live Migration for Computational Notebooks
  
This paper proposes On-demand-fork, a fast implementation of the fork system call specifically designed for applications with large memory footprints. On-demand-fork relies on the observation that copy-on-write can be generalized to page tables, even on commodity hardware. On-demand-fork executes faster than the traditional fork implementation by additionally sharing page tables between parent and child at fork time and selectively copying page tables in small chunks, on-demand, when handling page faults. On-demand-fork is a drop-in replacement for fork that requires no changes to applications or hardware.
+
[arXiv:2301.05861 '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level
  
We evaluated On-demand-fork on a range of micro-benchmarks and real-world workloads. On-demand-fork significantly reduces the fork invocation time and has improved scalability. For processes with 1 GB of allocated memory, On-demand-fork has a 65× performance advantage over Fork. We also evaluated On-demand-fork on testing, fuzzing, and snapshotting workloads of well-known applications, obtaining execution throughput improvements between 59% and 226% and up to 99% invocation latency reduction.
+
[EuroSys '21] On-demand-fork: a microsecond fork for memory-intensive and latency-sensitive applications

Latest revision as of 02:55, 18 December 2024

  • GPU CRIU

[SoCC '24] On-demand and Parallel Checkpoint/Restore for GPU Applications

[EuroSys '24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

[arXiv '23] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

[SC-W '23] Checkpoint/Restart for CUDA Kernels

[arXiv:2202.07848 '22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

[Wiley '21] Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

[EuroSys '20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

[HPEC '20] Using Container Migration for HPC Workloads Resilience




  • CRIU for Migration

[APNet '24] Software-based Live Migration for Containerized RDMA

[SEATED '24] Live Migration of Multi-Container Kubernetes Pods in Multi-Cluster Serverless Edge Systems

[ICT '24] Packet Buffering to Minimize Service Downtime and Packet Loss During Redundancy Switchover

[SIGMOD/PODS '24] Demonstration of ElasticNotebook: Migrating Live Computational Notebook States

[ICDCS '24] Dapper: A Lightweight and Extensible Framework for Live Program State Rewriting

[Cloud '24] FastMig: Leveraging FastFreeze to Establish Robust Service Liquidity in Cloud 2.0

[CCGRID '24] Workload-Aware Live Migratable Cloud Instance Detector

[VLDB '23] ElasticNotebook: Enabling Live Migration for Computational Notebooks

[SRDS '23] Transparent Fault Tolerance for Stateful Applications in Kubernetes with Checkpoint/Restore

[ICFEC '23] Migration of Isolated Application Across Heterogeneous Edge Systems

[TNSM '23] Design, Modeling, and Implementation of Robust Migration of Stateful Edge Microservices

[WORDS '23] Evicting for the greater good: The case for Reactive Checkpointing in serverless computing

[Cloud Summit '23] Microservice Debugging with Checkpoint-Restart

[ICC '23] Processing-Aware Migration Model for Stateful Edge Microservices

[DRONES '23] A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System

[arXiv:2301.05861 '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

[TOCS '22] H-Container: Enabling Heterogeneous-ISA Container Migration in Edge Computing

[VEE '22] Portkey: hypervisor-assisted container migration in nested cloud environments

[ICPADS '22] A Container Pre-copy Migration Method Based on Dirty Page Prediction and Compression

[NetSoft '22] Demonstration of Containerized Central Unit Live Migration in 5G Radio Access Network

[ATC '22] RRC: Responsive Replicated Containers

[HAL '22] Good Shepherds Care For Their Cattle: Seamless Pod Migration in Geo-Distributed Kubernetes

[ATC '21] MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications

[WoWMoM '21] Extending the QUIC Protocol to Support Live Container Migration at the Edge

[MobileCloud '20] Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart


  • CRIU Acceleration

[EuroSys '24] Pronghorn: Effective Checkpoint Orchestration for Serverless Hot-Starts

[FGCS '24] Prebaking runtime environments to improve the FaaS cold start latency

[Middleware '23] DynaCut: A Framework for Dynamic and Adaptive Program Customization

[Virginia Tech '23] CRIU-RTX: Remote Thread eXecution using Checkpoint/Restore in Userspace

[Virginia Tech '23] HetMigrate: Secure and Efficient Cross-architecture Process Live Migration

[OSDI '23] No Provisioned Concurrency: Fast RDMA-codesigned Remote Fork for Serverless Computing

[SC '22] Out of hypervisor (OoH): efficient dirty page tracking in userspace using hardware virtualization features

[JNCA '22] iContainer: Consecutive checkpointing with rapid resilience for immortal container-based services

[VLSI '21] Standard-compliant parallel SystemC simulation of loosely-timed transaction level models: From baremetal to Linux-based applications support

[Middleware '20] Prebaking Functions to Warm the Serverless Cold Start

[MEMSYS '19] Fast in-memory CRIU for docker containers

[MCHPC '19] Optimizing Post-Copy Live Migration with System-Level Checkpoint Using Fabric-Attached Memory



  • CRIU Security

[APSys '24] Towards Efficient End-to-End Encryption for Container Checkpointing Systems

[eBPF '24] Custom Page Fault Handling With eBPF

[ARES '24] Don't, Stop, Drop, Pause: Forensics of CONtainer CheckPOINTs (ConPoint)

[ATC '22] RRC: Responsive Replicated Containers

[NDSS '22] FitM: Binary-Only Coverage-Guided Fuzzing for Stateful Network Protocols

[SYSTEX '22] Transparent, Cross-ISA Enclave Offloading

[IPDPS '20] Fault-Tolerant Containers Using NiLiCon


  • CRIU for Database

[Journal of Cloud Computing '24] MDB-KCP: persistence framework of in-memory database with CRIU-based container checkpoint in Kubernetes

[VLDB '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

[VLDB '23] ElasticNotebook: Enabling Live Migration for Computational Notebooks

[arXiv:2301.05861 '23] Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

[EuroSys '21] On-demand-fork: a microsecond fork for memory-intensive and latency-sensitive applications