Guide to deploy Ray Cluster on GKE
A step-by-step production guide for Ray Cluster deployment distributed ML and LLM workloads
If you’re building large-scale ML systems - distributed training, batch inference, or LLM serving with vLLM - combining Kubernetes + Ray + GKE gives you a powerful, production-ready stack.
This post walks through:
- Architecture overview
- Setting up GKE (Standard vs Autopilot)
- Deploy Ray on Google Kubernetes Engine (GKE)
- Configure GPU-enabled Ray clusters
- Expose the Ray dashboard securely via Ingress
- Manage dependencies with
uv - Submit distributed jobs from your laptop or CI
- Prepare your setup for production-grade scaling
Architecture Overview
At a high level:
- Google Kubernetes Engine (GKE) → Infrastructure & orchestration
- Ray → Distributed compute engine
- PyTorch → Model training
- vLLM → High-performance LLM serving
How It All Fits Together
flowchart TD
A[User / CI] -->|Submit Job| B(GKE Ingress)
B --> C[Ray Head Pod]
C --> D[Ray Workers]
D --> E[GPU Nodes]
C --> F[Ray Dashboard 8265]
D --> G[PyTorch Training]
D --> H[vLLM Serving]
What Each Layer Does
| Layer | Responsibility |
|---|---|
| GKE | Provisions nodes, autoscaling, networking |
| KubeRay | Manages Ray clusters as CRDs |
| Ray | Schedules distributed jobs |
| vLLM | Fast LLM inference |
| PyTorch | Training & fine-tuning |
Prepare Your Environment
1
2
3
4
5
6
7
8
9
10
export PROJECT_ID=<project_id>
export REGION=us-central1
export ZONE=us-central1-a
export CLUSTER_NAME=ray-cluster
export POOL_NAME=gpu-node-pool
export NAMESPACE=llm
gcloud config set project $PROJECT_ID
gcloud config set billing/quota_project $PROJECT_ID
gcloud services enable container.googleapis.com
Connect to your cluster after creation:
1
gcloud container clusters get-credentials $CLUSTER_NAME --location=$REGION
Create a GKE Cluster
You have two options. Currently I have only worked on option B.
-
Option A - Autopilot (Managed Mode)
Pros:
- Less infrastructure management
- Ray operator can be enabled directly
1 2 3 4
gcloud container clusters create-auto $CLUSTER_NAME \ --location=$REGION \ --release-channel=rapid \ --enable-ray-operator
Autopilot currently has limitations with
--enable-ray-operatorin some regions.Autopilot example: [Deploy Ray Serve Stable Diffusion on GKE](https://docs.cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/tutorials/deploy-ray-serve-stable-diffusion#autopilot). ChatGPT guides: [Deploy RayCluster on GKE](https://chatgpt.com/share/6988d953-7834-800b-a8fd-1387e2bcedc3) · [RayCluster on GKE](https://chatgpt.com/share/6988d9ab-f750-800b-870b-f4b25bf6f281) -
Option B - Standard Cluster (More Control)
Recommended for GPU-heavy ML workloads.
[AI/ML orchestration on GKE](https://cloud.google.com/kubernetes-engine/docs/integrations/ai-infra) 1 2 3 4 5 6
gcloud container clusters create $CLUSTER_NAME \ --zone=$ZONE \ --machine-type e2-standard-4 \ --num-nodes=1 \ --enable-autoscaling \ --min-nodes=0 --max-nodes=2
Add GPU Node Pool (NVIDIA L4 Example)
1
2
3
4
5
6
7
gcloud container node-pools create $POOL_NAME \
--cluster=$CLUSTER_NAME \
--zone=$ZONE \
--accelerator type=nvidia-l4,count=1 \
--machine-type g2-standard-4 \
--enable-autoscaling \
--min-nodes=0 --max-nodes=2
Verify GPU and device plugin:
1
2
kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
kubectl get pods -n kube-system -l k8s-app=nvidia-gpu-device-plugin
Install KubeRay Operator (If Not Using Autopilot)
1
2
3
4
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1
Verify:
1
kubectl get pods
You should see kuberay-operator running.
Note: If you created the cluster with --enable-ray-operator (Autopilot), skip this step—the Ray operator is already installed.
KubeRay kubectl-ray Plugin (Autopilot Only)
For Autopilot clusters with the GKE Ray add-on, you may need the KubeRay kubectl plugin:
1
2
3
4
5
6
7
8
9
# Check your KubeRay version (from CRD annotations)
kubectl get crd rayclusters.ray.io -o jsonpath='{.metadata.annotations}' ; echo
# Install kubectl-ray (replace v1.4.2 with your version)
curl -LO https://github.com/ray-project/kuberay/releases/download/v1.4.2/kubectl-ray_v1.4.2_linux_amd64.tar.gz
tar -xvf kubectl-ray_v1.4.2_linux_amd64.tar.gz
cp kubectl-ray ~/.local/bin
kubectl ray version
Deploy a GPU-Enabled RayCluster
1
kubectl apply -f raycluster-gpu.yaml
Check status:
1
2
kubectl get rayclusters
kubectl get pods --selector=ray.io/cluster=raycluster-gpu
RayCluster Internal Structure
flowchart TB
subgraph GKE Cluster
H[Ray Head Pod]
W1[Worker Pod 1]
W2[Worker Pod 2]
end
H --> W1
H --> W2
W1 --> GPU1[NVIDIA GPU]
W2 --> GPU2[NVIDIA GPU]
Access Ray Head Pod
1
2
3
4
5
export HEAD_POD=$(kubectl get pods \
--selector=ray.io/node-type=head \
-o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- bash
Expose Ray Dashboard (Port 8265) via GKE Ingress
GKE supports gce (external) and gce-internal ingress modes. For gce-internal, you must create a Proxy-Only subnet.
1
2
3
4
5
# Get Ray head service name, then update ray-dashboard-ingress.yaml
kubectl get svc
kubectl apply -f ray-dashboard-ingress.yaml
kubectl get ingress
After a few minutes, GKE assigns an external IP.
Visit:
1
2
3
http://<EXTERNAL_IP>
Networking Flow
flowchart LR
User --> Ingress
Ingress --> RayHeadService
RayHeadService --> RayHeadPod
RayHeadPod --> RayWorkers
Dependency Management with uv
Ray supports runtime environments.
Important: Local Python version must match the Ray image. For example, rayproject/ray:2.53.0-gpu uses Python 3.10.19.
Use via ray.init() (Inside Ray Pods)
When running code directly inside the head/worker pod, use runtime_env with uv:
1
uv export --format requirements.txt -o requirements.txt
1
2
# runtime_env expects requirements.txt, not pyproject.toml
ray.init(runtime_env={"uv": "./path/requirements.txt"})
Pattern A (Best for Iteration): ray job submit … – uv run …
Keep a repo locally (or on CI) with pyproject.toml, uv.lock, and your scripts. Submit from any machine that can reach the Ingress:
1
2
3
4
5
6
7
uv lock
ray job submit \
--address="http://<INGRESS_IP_OR_DNS>:8265" \
--no-wait \
--working-dir . \
-- uv run main.py
Ray uploads your working directory and installs dependencies. Use --no-wait for fire-and-forget, then:
1
2
3
ray job logs <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"
ray job status <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"
ray job stop <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"
Tip: Set export RAY_API_SERVER_ADDRESS="http://<INGRESS_IP_OR_DNS>:8265" to avoid passing --address every time.
Remote working_dir (Avoid Local Upload)
Instead of --working-dir ., use a remote URI so Ray fetches code from GitHub or GCS:
| Source | Example |
|---|---|
| Public GitHub | https://github.com/user/repo/archive/HEAD.zip |
| Private GitHub | https://user:TOKEN@github.com/user/repo/archive/HEAD.zip |
| GCS | gs://bucket/code.zip |
Example with a subdirectory (e.g. src/ in the repo):
1
2
3
4
ray job submit \
--address="http://<INGRESS_IP_OR_DNS>:8265" \
--working-dir "https://github.com/user/repo/archive/HEAD.zip" \
-- uv run --directory src main_src.py
Pattern B (Best for Production)
Bake dependencies into the image to avoid per-job installs:
1
2
3
FROM rayproject/ray-ml:2.x-gpu
COPY pyproject.toml uv.lock .
RUN uv sync --frozen
Use this image in your RayCluster for both head and workers. Then job submission can ship only code or parameters.
Option 3: Remote Code Only (No Local Upload)
If you want to avoid uploading from your machine entirely:
- Zip your repo (single top-level directory) and upload to GCS.
- Submit with
--runtime-env-jsonandworking_dir: "gs://bucket/code.zip":
1
2
3
4
ray job submit \
--address="http://<INGRESS_IP_OR_DNS>:8265" \
--runtime-env-json='{"working_dir": "gs://bucket/code.zip"}' \
-- python main.py
Production Workflow
flowchart TD
Dev[Developer] -->|Push Code| GitHub
GitHub --> CI
CI -->|Build Image| GCR
GCR -->|Deploy| GKE
GKE --> RayCluster
Submitting Jobs from Local / Different Machines (via Ingress)
Submit from your laptop, another engineer’s machine, or a CI runner—as long as it has the code and can reach http://<INGRESS_IP>:8265.
1. Install Ray CLI
1
uv tool install "ray[default]" # runtime env feature requires ray[default]
2. Submit Options
| Option | Use case |
|---|---|
| Local dir | --working-dir . — uploads current directory |
| Remote GitHub/GCS | --working-dir "https://github.com/user/repo/archive/HEAD.zip" or gs://bucket/code.zip |
| Subdirectory | Add -- uv run --directory src main.py when code lives in a subdir |
| No local upload | --runtime-env-json='{"working_dir": "gs://bucket/code.zip"}' |
Example (local):
1
2
3
4
5
uv lock
ray job submit \
--address="http://<INGRESS_IP>:8265" \
--working-dir . \
-- uv run main.py
Example (remote repo + subdirectory):
1
2
3
4
ray job submit \
--address="http://<INGRESS_IP>:8265" \
--working-dir "https://github.com/user/repo/archive/HEAD.zip" \
-- uv run --directory src main.py
Monitoring & Autoscaling
You should configure:
- Ray autoscaling
- Prometheus + Grafana
- Cloud Monitoring integration
Ray metrics default port: 8080
When Should You Use This Stack?
Use Ray + GKE when:
- Distributed training
- Multi-GPU LLM serving
- Batch inference pipelines
- Multi-team ML platform
- CI/CD for ML infra
Avoid if:
- Small experiments
- Single-node workloads
- No need for autoscaling
Final Thoughts
Running Ray on GKE gives you:
- Kubernetes-native autoscaling
- GPU scheduling
- Production-ready LLM serving
- Distributed PyTorch training
- Clean job submission model
This stack scales from experimentation → production seamlessly.
Citations
If you found this useful, please cite this as:
Nguyen, Quan H. (Feb 2026). Guide to deploy Ray Cluster on GKE. https://quanhnguyen232.github.io.
or as a BibTeX entry:
1
2
3
4
5
6
7
@article{nguyen2026guide-to-deploy-ray-cluster-on-gke,
title = {Guide to deploy Ray Cluster on GKE},
author = {Nguyen, Quan H.},
year = {2026},
month = {Feb},
url = {https://quanhnguyen232.github.io/blog/2026/deploy-ray/}
}
Enjoy Reading This Article?
Here are some more articles you might like to read next: