k8s-cli Documentation
Welcome to k8s-cli, a lightweight SkyPilot-compatible Kubernetes task launcher that enables running computational tasks on Kubernetes clusters using the familiar SkyPilot YAML specification format.
What is k8s-cli?
k8s-cli is a command-line tool and API server that simplifies running computational workloads on Kubernetes. It provides:
- Simple Task Submission: Submit jobs using familiar SkyPilot YAML syntax
- Multi-Node Support: Run distributed workloads across multiple nodes
- Persistent Storage: Create and manage volumes for data persistence
- Resource Management: Configure CPU, memory, and GPU requirements
- User Isolation: Multi-tenant support with user-scoped resources
- Real-Time Monitoring: Track task status and stream logs
Quick Start
Installation
Start the API Server
The server starts on http://localhost:8000 by default.
Submit Your First Task
- Login (set your username):
- Create a task definition (
hello.yaml):
name: hello-world
resources:
cpus: "1"
memory: "1Gi"
image_id: "python:3.11-slim"
run: |
echo "Hello from k8s-cli!"
python -c "print('Task completed successfully')"
- Submit the task:
The task will be submitted, and logs will be tailed automatically. You'll receive a unique task ID (e.g., abc12345).
- Check task status:
Key Features
Tasks
Tasks are computational jobs that run on Kubernetes. Each task can:
- Run on one or more nodes (for distributed workloads)
- Request specific resources (CPU, memory, GPUs)
- Use custom container images
- Mount persistent volumes for data storage
- Define setup and run commands
- Access environment variables
Learn more in the Task Management Guide.
Volumes
Volumes provide persistent storage for your tasks. You can:
- Create PersistentVolumeClaims with custom sizes and storage classes
- Mount volumes in tasks for reading and writing data
- Share data between multiple tasks
- Choose from different access modes (ReadWriteOnce, ReadWriteMany, ReadOnlyMany)
Learn more in the Volume Management Guide.
Multi-Tenancy
k8s-cli supports multiple users working on the same cluster:
- User-scoped resources (tasks and volumes)
- Automatic resource isolation using Kubernetes labels
- Optional admin operations to view all users' resources
Core Concepts
Task Lifecycle
- Submit: Create a task from a YAML definition
- Pending: Kubernetes schedules the task pods
- Running: Task executes on one or more nodes
- Completed/Failed: Task finishes with success or failure
- Optional: Stop running tasks manually
Volume Lifecycle
- Create: Provision a PersistentVolumeClaim
- Bind: Kubernetes binds the PVC to storage
- Mount: Use the volume in one or more tasks
- Delete: Remove the volume and its data
Multi-Node Tasks
For distributed workloads, specify num_nodes in your task definition:
- Each node gets a unique
NODE_RANK(0, 1, 2, ...) - All nodes receive
NUM_NODESenvironment variable - Kubernetes creates one Job per node
- Task is complete only when all nodes succeed
Architecture
k8s-cli consists of four main components:
1. API Server
FastAPI-based REST API that handles task and volume operations: - Task submission, listing, status, logs, and stopping - Volume creation, listing, and deletion - User authentication via headers
2. CLI
Typer-based command-line interface with command groups:
- auth: User login and identity management
- jobs: Task operations (submit, list, status, logs, stop)
- volumes: Volume operations (create, list, delete)
3. Kubernetes Executor
Core engine that manages Kubernetes resources: - Creates and manages Jobs for tasks - Creates and manages PersistentVolumeClaims for volumes - Aggregates status across multi-node tasks - Streams logs from pods
4. Task Models
Pydantic models for data validation: - Task definitions and configurations - API request/response schemas - Status tracking and metadata
Example Workflows
Machine Learning Training
# 1. Create volume for datasets
uv run k8s-cli volumes create training-data 50Gi
# 2. Create volume for model checkpoints
uv run k8s-cli volumes create checkpoints 20Gi
# 3. Submit training task
cat > train.yaml <<EOF
name: model-training
resources:
cpus: "8"
memory: "32Gi"
accelerators: "2"
image_id: "pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime"
volumes:
/data: training-data
/checkpoints: checkpoints
setup: |
pip install transformers datasets
run: |
python train.py --data /data --output /checkpoints
EOF
uv run k8s-cli jobs submit train.yaml
# 4. Monitor progress
uv run k8s-cli jobs list
uv run k8s-cli jobs logs <task-id>
Distributed Computing
# Submit multi-node task
cat > distributed.yaml <<EOF
name: distributed-job
num_nodes: 4
resources:
cpus: "4"
memory: "8Gi"
run: |
echo "Node \$NODE_RANK of \$NUM_NODES"
python distributed_compute.py --rank \$NODE_RANK --world-size \$NUM_NODES
EOF
uv run k8s-cli jobs submit distributed.yaml
Data Processing Pipeline
# Step 1: Download data
uv run k8s-cli volumes create raw-data 100Gi
uv run k8s-cli jobs submit download_data.yaml
# Step 2: Process data
uv run k8s-cli volumes create processed-data 200Gi
uv run k8s-cli jobs submit process_data.yaml
# Step 3: Run analysis
uv run k8s-cli jobs submit analyze.yaml
Documentation Structure
This documentation is organized into the following sections:
- index.md (this page) - Introduction and overview
- tasks.md - Complete guide to task management
- volumes.md - Complete guide to volume management
- ../SPEC.md - Technical specification with API details
Requirements
- Kubernetes cluster with configured access (kubeconfig)
- Python 3.11+
- kr8s library for Kubernetes API access
- Storage provisioner (for dynamic volume creation)
- GPU support (optional, for accelerator workloads)
Configuration
Environment Variables
K8S_CLI_API_URL: API server URL (default:http://localhost:8000)K8S_CLI_API_PORT: Server port for API server (default:8000)
Kubernetes Configuration
- Uses your default kubeconfig context
- All resources created in the
defaultnamespace - Requires appropriate RBAC permissions for creating Jobs and PVCs
Getting Help
CLI Help
# General help
uv run k8s-cli --help
# Command group help
uv run k8s-cli jobs --help
uv run k8s-cli volumes --help
uv run k8s-cli auth --help
# Specific command help
uv run k8s-cli jobs submit --help
API Documentation
Once the API server is running, visit:
- http://localhost:8000/docs - Interactive Swagger UI
- http://localhost:8000/redoc - ReDoc documentation
Health Check
Check if the API server is running:
Common Use Cases
1. Interactive Development
Submit tasks with --detach to continue working:
2. Batch Processing
Submit multiple tasks and monitor:
for file in tasks/*.yaml; do
uv run k8s-cli jobs submit "$file" --detach
done
uv run k8s-cli jobs list
3. Experiment Tracking
Use descriptive task names and volumes:
4. Resource Cleanup
Stop all your tasks:
Delete unused volumes:
Best Practices
- Always specify resources: Define CPU and memory to ensure predictable scheduling
- Use meaningful names: Name tasks and volumes descriptively for easy identification
- Separate volumes by purpose: Use different volumes for data, checkpoints, logs, etc.
- Check task status: Monitor tasks before submitting new work
- Clean up regularly: Delete completed tasks and unused volumes
- Use appropriate images: Choose minimal base images for faster startup
- Test locally first: Validate task definitions with small resource requests
- Back up important data: Volumes are persistent but not backed up automatically
Next Steps
Ready to dive deeper? Check out:
- Task Management Guide - Learn all about submitting and managing tasks
- Volume Management Guide - Master persistent storage
- Technical Specification - Detailed API and architecture reference
- Example Task - See a complete task definition
Troubleshooting
API Server Won't Start
- Check if port 8000 is already in use
- Verify kubeconfig is properly configured
- Ensure all dependencies are installed (
uv sync)
Task Submission Fails
- Verify the API server is running
- Check task YAML syntax (the
runfield is required) - Ensure you're logged in (
uv run k8s-cli auth whoami)
Task Stuck in Pending
- Check cluster resources:
kubectl get nodes - Verify resource requests don't exceed node capacity
- Check for volume binding issues
Volume Mount Failures
- Ensure volume status is
Bound - Verify volume name matches in task definition
- For multi-node tasks, use
ReadWriteManyaccess mode
For more detailed troubleshooting, see the Tasks and Volumes guides.
Contributing
k8s-cli follows these conventions:
- Run
uv syncto set up the development environment - Format code with
ruff format . - Check code with
ruff check . --fix - Update
SPEC.mdwhen adding or removing features
License
See LICENSE for details.