Volume Management Guide
This guide explains how to create, manage, and use persistent volumes in k8s-cli for storing data across task executions.
Overview
Volumes in k8s-cli are Kubernetes PersistentVolumeClaims (PVCs) that provide persistent storage for your tasks. They allow you to:
- Store training data, models, and checkpoints
- Share data between multiple tasks
- Persist data beyond task completion
- Use different storage classes and access modes
Quick Start
1. Create a Volume
Create a 10Gi volume for training data:
The command returns a unique 8-character volume ID.
2. List Volumes
View all your volumes:
With details:
List all users' volumes (admin):
3. Use Volume in Task
Mount the volume in your task definition:
name: training-job
resources:
cpus: "4"
memory: "8Gi"
volumes:
/data: training-data # Mount volume at /data
run: |
python train.py --data /data
4. Delete a Volume
Delete a volume (with confirmation):
Force delete without confirmation:
Volume Creation Options
Basic Creation
Parameters:
- name: Logical volume name (used to reference the volume in tasks)
- size: Storage size with unit (e.g., 10Gi, 1Ti, 500Mi)
Storage Class
Specify a custom storage class:
Storage classes are defined by your Kubernetes cluster and may include:
- standard (default)
- fast-ssd
- slow-hdd
- Custom classes defined by your cluster admin
Access Modes
Specify how the volume can be mounted:
Available Access Modes:
| Access Mode | Description | Use Case |
|---|---|---|
ReadWriteOnce |
Can be mounted as read-write by a single node | Default, suitable for most tasks |
ReadWriteMany |
Can be mounted as read-write by multiple nodes | Multi-node tasks sharing data |
ReadOnlyMany |
Can be mounted as read-only by multiple nodes | Shared read-only datasets |
Multiple Access Modes:
Volume Status
Volumes can have the following status values (from Kubernetes PVC phases):
Pending: Volume is being created or waiting for storageBound: Volume is ready and bound to underlying storageLost: Volume lost connection to underlying storage (rare)
Only Bound volumes can be successfully mounted in tasks.
Using Volumes in Tasks
Basic Volume Mount
Example:
name: data-processing
resources:
cpus: "2"
memory: "4Gi"
volumes:
/data: training-data
run: |
ls -la /data
python process.py --input /data/raw --output /data/processed
Multiple Volume Mounts
Volume Resolution
When you specify a volume by name in a task, k8s-cli automatically resolves it:
- Searches for a PVC with label
volume-name={name}owned by your user - If found, uses that PVC
- If not found, treats the name as a literal PVC name
This allows you to: - Use volumes created via k8s-cli by their logical name - Reference existing PVCs directly if needed
Examples
Example 1: Training with Persistent Storage
Create volume and run training task:
# Create volume for training data
k8s-cli volumes create ml-data 100Gi
# Create volume for model checkpoints
k8s-cli volumes create checkpoints 50Gi
# Submit training task
cat > train.yaml <<EOF
name: model-training
resources:
cpus: "8"
memory: "32Gi"
accelerators: "2"
image_id: "pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime"
volumes:
/data: ml-data
/checkpoints: checkpoints
envs:
DATA_DIR: "/data"
CHECKPOINT_DIR: "/checkpoints"
setup: |
pip install transformers datasets tensorboard
run: |
python train.py \
--data_dir $DATA_DIR \
--checkpoint_dir $CHECKPOINT_DIR \
--epochs 100
EOF
k8s-cli jobs submit train.yaml
Example 2: Multi-Node Task with Shared Volume
For multi-node tasks, use ReadWriteMany access mode:
# Create shared volume
k8s-cli volumes create shared-workspace 200Gi --access-modes ReadWriteMany
# Submit multi-node task
cat > distributed.yaml <<EOF
name: distributed-training
num_nodes: 4
resources:
cpus: "8"
memory: "16Gi"
accelerators: "2"
volumes:
/workspace: shared-workspace
run: |
# All nodes can read/write to /workspace
echo "Node $NODE_RANK initialized" > /workspace/node-$NODE_RANK.txt
python distributed_train.py --workspace /workspace
EOF
k8s-cli jobs submit distributed.yaml
Example 3: Data Preprocessing Pipeline
# Create volumes
k8s-cli volumes create raw-data 50Gi
k8s-cli volumes create processed-data 100Gi
# Step 1: Download data
cat > download.yaml <<EOF
name: download-data
volumes:
/data: raw-data
run: |
wget -P /data https://example.com/dataset.tar.gz
tar -xzf /data/dataset.tar.gz -C /data
EOF
k8s-cli jobs submit download.yaml
# Step 2: Process data
cat > process.yaml <<EOF
name: process-data
volumes:
/input: raw-data
/output: processed-data
run: |
python preprocess.py --input /input --output /output
EOF
k8s-cli jobs submit process.yaml
# Step 3: Train model
cat > train.yaml <<EOF
name: train-model
volumes:
/data: processed-data
/checkpoints: model-checkpoints
run: |
python train.py --data /data --checkpoints /checkpoints
EOF
k8s-cli jobs submit train.yaml
Example 4: Using Fast Storage
For IO-intensive workloads, use fast storage:
k8s-cli volumes create fast-cache 20Gi --storage-class fast-ssd
cat > io-intensive.yaml <<EOF
name: io-task
resources:
cpus: "16"
memory: "64Gi"
volumes:
/cache: fast-cache
run: |
# Fast storage for temporary data
./io_intensive_workload --cache /cache
EOF
k8s-cli jobs submit io-intensive.yaml
Volume Lifecycle
- Creation: Volume is created with status
Pending - Binding: Kubernetes binds the PVC to underlying storage (status becomes
Bound) - Usage: Volume can be mounted in tasks
- Deletion: Volume and its data are permanently deleted
Important Notes: - Volumes persist after tasks complete (data is retained) - Deleting a volume permanently destroys its data - Only the volume owner can delete it - Volumes must be unmounted (tasks stopped) before deletion
Storage Best Practices
1. Size Planning
- Allocate sufficient space for your data plus ~20% buffer
- Consider data growth over time
- Monitor volume usage with
kubectl get pvc
2. Access Modes
- Use
ReadWriteOncefor single-task workloads (more efficient) - Use
ReadWriteManyonly when multiple nodes need concurrent write access - Use
ReadOnlyManyfor shared datasets that don't change
3. Storage Classes
- Use default storage for general purposes
- Use fast-ssd for databases, checkpoints, or IO-intensive tasks
- Use slow-hdd for archival or infrequently accessed data
- Check available storage classes:
kubectl get storageclass
4. Volume Organization
- Create separate volumes for different data types:
/datafor input datasets/checkpointsfor model checkpoints/logsfor logs and metrics/outputfor results- Use meaningful volume names (e.g.,
imagenet-train,bert-checkpoints)
5. Data Management
- Back up important data outside the cluster
- Clean up unused volumes regularly
- Use smaller volumes for experiments, larger for production
Troubleshooting
Volume Stuck in Pending
Problem: Volume status remains Pending after creation.
Solutions:
- Check if the cluster has available storage: kubectl get pv
- Verify the storage class exists: kubectl get storageclass
- Check PVC events: kubectl describe pvc <pvc-name>
- Ensure the cluster has a default storage provisioner
Cannot Mount Volume in Task
Problem: Task fails to start due to volume mount issues.
Solutions:
- Verify volume status is Bound: k8s-cli volumes list
- Check volume name spelling in task definition
- Ensure you're the volume owner (or use --all-users to check)
- For multi-node tasks, verify access mode is ReadWriteMany
Volume Deletion Fails
Problem: Cannot delete a volume.
Solutions:
- Stop all tasks using the volume: k8s-cli jobs list then k8s-cli jobs stop <task-id>
- Wait for tasks to fully terminate
- Check PVC finalizers: kubectl get pvc <pvc-name> -o yaml
- Verify you're the volume owner
Out of Disk Space
Problem: Task fails because volume is full.
Solutions: - Create a larger volume and migrate data - Clean up unnecessary files in the existing volume - Use compression for data storage - Consider using object storage for large datasets
Wrong Storage Class
Problem: Volume created with incorrect storage class.
Solutions:
- Delete the volume and recreate with correct --storage-class
- Or create a new volume and migrate data
- Check available storage classes: kubectl get storageclass
Advanced Usage
Checking Volume Details
Get detailed volume information via API:
Listing All Volumes (Admin)
See all volumes across all users:
Volume Naming Convention
PVCs created by k8s-cli follow the naming pattern:
Example: training-data-vol12345
Labels
All volumes are labeled with:
- skypilot-volume=true
- volume-id={id}
- volume-name={name}
- username={sanitized-username}
These labels enable filtering and user isolation.
API Reference
For programmatic access, use the REST API endpoints:
POST /volumes/create- Create a new volumeGET /volumes- List volumesGET /volumes/{volume_id}- Get volume statusDELETE /volumes/{volume_id}- Delete a volume
See the SPEC.md for detailed API documentation.
Next Steps
- Learn about Task Management
- See Example Workflows in the specification
- Explore the example_task.yaml for reference