Task Management Guide
This guide explains how to use the task management features in k8s-cli to submit, monitor, and manage computational tasks on Kubernetes.
Overview
Tasks are computational jobs that run on Kubernetes using the SkyPilot-compatible YAML specification. Each task can run on one or more nodes with configurable resources like CPU, memory, and GPUs.
Quick Start
1. Submit a Task
Create a task definition file (task.yaml):
name: my-training-job
num_nodes: 1
resources:
cpus: "2"
memory: "4Gi"
image_id: "python:3.11-slim"
setup: |
pip install numpy pandas
run: |
python train.py
Submit the task:
The command will:
- Submit the task to the API server
- Return a unique 8-character task ID
- Automatically tail the task logs (unless --detach is used)
2. List Tasks
View all your tasks:
With details:
List all users' tasks (admin):
3. Check Task Status
Get detailed status for a specific task:
4. View Logs
Tail logs from a running or completed task:
Press Ctrl+C to stop tailing logs.
5. Stop a Task
Stop a specific task:
Stop all your tasks:
Stop all tasks for all users (admin):
Task Definition Reference
Required Fields
run(string): The main command to execute. This is the only required field.
Optional Fields
Basic Configuration
name(string): Task name. Defaults totask-{id}if not specified.num_nodes(integer): Number of nodes to run the task on. Default: 1, minimum: 1.workdir(string): Working directory for the task.
Resources
The resources section configures compute resources for each node:
resources:
cpus: "2" # CPU allocation (e.g., "2", "500m")
memory: "4Gi" # Memory allocation (e.g., "4Gi", "512Mi")
accelerators: "1" # GPU allocation (e.g., "1", "V100:2")
image_id: "python:3.13-slim" # Container image (default: "python:3.13-slim")
Resource Behavior:
- CPU and memory are set as both requests and limits
- GPUs are set as limits only (using nvidia.com/gpu)
- Unsupported fields: instance_type, use_spot, disk_size, ports (ignored)
Environment Variables
Define environment variables for your task:
Auto-injected Variables:
- NODE_RANK: Node index (0-based) for multi-node tasks
- NUM_NODES: Total number of nodes
Volume Mounts
Mount persistent volumes into your task:
volumes:
/data: training-data # Mount volume named "training-data" at /data
/checkpoints: model-checkpoints
The volume name is automatically resolved to the corresponding PersistentVolumeClaim. See Volumes Guide for details.
Setup and Run Commands
setup(string): Commands to run before the main task (e.g., installing dependencies).run(string): Main command to execute (required).
Multi-Node Tasks
For distributed workloads, specify the number of nodes:
name: distributed-training
num_nodes: 4
resources:
cpus: "8"
memory: "16Gi"
accelerators: "2"
image_id: "nvcr.io/nvidia/pytorch:23.10-py3"
envs:
MASTER_ADDR: "master-service"
MASTER_PORT: "29500"
run: |
python -m torch.distributed.launch \
--nproc_per_node=2 \
--nnodes=$NUM_NODES \
--node_rank=$NODE_RANK \
train_distributed.py
How It Works:
- Creates one Kubernetes Job per node
- Job names follow the pattern: {task-name}-{task-id}-node-{idx}
- Each node gets NODE_RANK (0 to num_nodes-1) and NUM_NODES environment variables
- Logs from all nodes are prefixed with node-{idx} |
Status Aggregation:
- Task is pending if no nodes are running
- Task is running if at least one node is active
- Task is failed if any node fails
- Task is completed only when all nodes succeed
Task Lifecycle
- Submission: Task is submitted with a unique 8-character ID
- Pending: Kubernetes is scheduling pods
- Running: At least one pod is executing
- Completed: All nodes finished successfully
- Failed: One or more nodes failed
Examples
Simple Python Task
name: data-processing
resources:
cpus: "4"
memory: "8Gi"
image_id: "python:3.11"
setup: |
pip install pandas numpy
run: |
python process_data.py
GPU Training Task
name: gpu-training
resources:
cpus: "8"
memory: "32Gi"
accelerators: "2"
image_id: "nvidia/cuda:12.1.0-runtime-ubuntu22.04"
setup: |
pip install torch torchvision
run: |
python train_model.py --gpu
Task with Volume Mounts
name: training-with-data
resources:
cpus: "4"
memory: "16Gi"
image_id: "python:3.11"
volumes:
/data: training-data
/output: model-checkpoints
envs:
DATA_PATH: "/data"
OUTPUT_PATH: "/output"
run: |
python train.py --data $DATA_PATH --output $OUTPUT_PATH
Detached Submission
Submit a task without tailing logs:
Later, view logs:
Tips and Best Practices
- Resource Allocation: Always specify CPU and memory to ensure predictable scheduling
- Multi-Node Tasks: Use
$NODE_RANKand$NUM_NODESfor distributed coordination - Volume Mounts: Create volumes before submitting tasks that need persistent storage
- Container Images: Use specific image tags for reproducibility (e.g.,
python:3.11.5instead ofpython:3.11) - Setup Commands: Use the
setupfield for one-time initialization to keep theruncommand clean - Detached Mode: Use
--detachfor long-running tasks and check logs later - Task Naming: Give meaningful names to tasks for easier identification in listings
Troubleshooting
Task Stuck in Pending
- Check if the cluster has sufficient resources
- Verify the requested resources don't exceed node capacity
- Ensure volume claims are bound (if using volumes)
Task Failed Immediately
- Check logs with
k8s-cli jobs logs <task-id> - Verify the container image exists and is accessible
- Check for syntax errors in setup/run commands
Cannot Mount Volume
- Verify the volume exists:
k8s-cli volumes list - Ensure the volume is bound (status should be "Bound")
- Check the volume name matches exactly in the task definition
Multi-Node Task Partially Failed
- Check individual node logs for error messages
- Verify network connectivity between nodes
- Ensure all nodes can access shared volumes (use
ReadWriteManyaccess mode)
API Reference
For programmatic access, use the REST API endpoints:
POST /tasks/submit- Submit a new taskGET /tasks- List tasksGET /tasks/{task_id}- Get task statusGET /tasks/{task_id}/logs- Stream task logsPOST /tasks/{task_id}/stop- Stop a taskPOST /tasks/stop- Stop multiple tasks
See the SPEC.md for detailed API documentation.