Guide to deploy Ray Cluster on GKE

A step-by-step production guide for Ray Cluster deployment distributed ML and LLM workloads

If you’re building large-scale ML systems - distributed training, batch inference, or LLM serving with vLLM - combining Kubernetes + Ray + GKE gives you a powerful, production-ready stack.See Ray on Kubernetes, Ray Clusters Overview, and Ray on GKE for official documentation.

This post walks through:

  • Architecture overview
  • Setting up GKE (Standard vs Autopilot)
  • Deploy Ray on Google Kubernetes Engine (GKE)
  • Configure GPU-enabled Ray clusters
  • Expose the Ray dashboard securely via Ingress
  • Manage dependencies with uv
  • Submit distributed jobs from your laptop or CI
  • Prepare your setup for production-grade scaling

Architecture Overview

At a high level:

  • Google Kubernetes Engine (GKE) → Infrastructure & orchestration
  • Ray → Distributed compute engine
  • PyTorch → Model training
  • vLLM → High-performance LLM serving

How It All Fits Together

flowchart TD
    A[User / CI] -->|Submit Job| B(GKE Ingress)
    B --> C[Ray Head Pod]
    C --> D[Ray Workers]
    D --> E[GPU Nodes]
    C --> F[Ray Dashboard 8265]
    D --> G[PyTorch Training]
    D --> H[vLLM Serving]

What Each Layer Does

Layer Responsibility
GKE Provisions nodes, autoscaling, networking
KubeRay Manages Ray clusters as CRDs
Ray Schedules distributed jobs
vLLM Fast LLM inference
PyTorch Training & fine-tuning

Prepare Your Environment

1
2
3
4
5
6
7
8
9
10
export PROJECT_ID=<project_id>
export REGION=us-central1
export ZONE=us-central1-a
export CLUSTER_NAME=ray-cluster
export POOL_NAME=gpu-node-pool
export NAMESPACE=llm

gcloud config set project $PROJECT_ID
gcloud config set billing/quota_project $PROJECT_ID
gcloud services enable container.googleapis.com

Connect to your cluster after creation:

1
gcloud container clusters get-credentials $CLUSTER_NAME --location=$REGION

Create a GKE Cluster

You have two options. Currently I have only worked on option B.

  1. Option A - Autopilot (Managed Mode)

    Pros:

    • Less infrastructure management
    • Ray operator can be enabled directly
    1
    2
    3
    4
    
     gcloud container clusters create-auto $CLUSTER_NAME \
         --location=$REGION \
         --release-channel=rapid \
         --enable-ray-operator
    

    Autopilot currently has limitations with --enable-ray-operator in some regions.Autopilot example: [Deploy Ray Serve Stable Diffusion on GKE](https://docs.cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/tutorials/deploy-ray-serve-stable-diffusion#autopilot). ChatGPT guides: [Deploy RayCluster on GKE](https://chatgpt.com/share/6988d953-7834-800b-a8fd-1387e2bcedc3) · [RayCluster on GKE](https://chatgpt.com/share/6988d9ab-f750-800b-870b-f4b25bf6f281)

  2. Option B - Standard Cluster (More Control)

    Recommended for GPU-heavy ML workloads.[AI/ML orchestration on GKE](https://cloud.google.com/kubernetes-engine/docs/integrations/ai-infra)

    1
    2
    3
    4
    5
    6
    
     gcloud container clusters create $CLUSTER_NAME \
     --zone=$ZONE \
     --machine-type e2-standard-4 \
     --num-nodes=1 \
     --enable-autoscaling \
     --min-nodes=0 --max-nodes=2
    

Add GPU Node Pool (NVIDIA L4 Example)

1
2
3
4
5
6
7
gcloud container node-pools create $POOL_NAME \
  --cluster=$CLUSTER_NAME \
  --zone=$ZONE \
  --accelerator type=nvidia-l4,count=1 \
  --machine-type g2-standard-4 \
  --enable-autoscaling \
  --min-nodes=0 --max-nodes=2

Verify GPU and device plugin:

1
2
kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
kubectl get pods -n kube-system -l k8s-app=nvidia-gpu-device-plugin

Install KubeRay Operator (If Not Using Autopilot)

1
2
3
4
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1

Verify:

1
kubectl get pods

You should see kuberay-operator running.

Note: If you created the cluster with --enable-ray-operator (Autopilot), skip this step—the Ray operator is already installed.


KubeRay kubectl-ray Plugin (Autopilot Only)

For Autopilot clusters with the GKE Ray add-on, you may need the KubeRay kubectl plugin:

1
2
3
4
5
6
7
8
9
# Check your KubeRay version (from CRD annotations)
kubectl get crd rayclusters.ray.io -o jsonpath='{.metadata.annotations}' ; echo

# Install kubectl-ray (replace v1.4.2 with your version)
curl -LO https://github.com/ray-project/kuberay/releases/download/v1.4.2/kubectl-ray_v1.4.2_linux_amd64.tar.gz
tar -xvf kubectl-ray_v1.4.2_linux_amd64.tar.gz
cp kubectl-ray ~/.local/bin

kubectl ray version

Deploy a GPU-Enabled RayCluster

1
kubectl apply -f raycluster-gpu.yaml

Check status:

1
2
kubectl get rayclusters
kubectl get pods --selector=ray.io/cluster=raycluster-gpu

RayCluster Internal Structure

flowchart TB
    subgraph GKE Cluster
        H[Ray Head Pod]
        W1[Worker Pod 1]
        W2[Worker Pod 2]
    end

    H --> W1
    H --> W2
    W1 --> GPU1[NVIDIA GPU]
    W2 --> GPU2[NVIDIA GPU]

Access Ray Head Pod

1
2
3
4
5
export HEAD_POD=$(kubectl get pods \
  --selector=ray.io/node-type=head \
  -o custom-columns=POD:metadata.name --no-headers)

kubectl exec -it $HEAD_POD -- bash

Expose Ray Dashboard (Port 8265) via GKE Ingress

GKE supports gce (external) and gce-internal ingress modes. For gce-internal, you must create a Proxy-Only subnet.

1
2
3
4
5
# Get Ray head service name, then update ray-dashboard-ingress.yaml
kubectl get svc

kubectl apply -f ray-dashboard-ingress.yaml
kubectl get ingress

After a few minutes, GKE assigns an external IP.

Visit:

1
2
3
http://<EXTERNAL_IP>


Networking Flow

flowchart LR
    User --> Ingress
    Ingress --> RayHeadService
    RayHeadService --> RayHeadPod
    RayHeadPod --> RayWorkers

Dependency Management with uv

Ray supports runtime environments.Ray docs: Environment Dependencies

Important: Local Python version must match the Ray image. For example, rayproject/ray:2.53.0-gpu uses Python 3.10.19.


Use via ray.init() (Inside Ray Pods)

When running code directly inside the head/worker pod, use runtime_env with uv:

1
uv export --format requirements.txt -o requirements.txt
1
2
# runtime_env expects requirements.txt, not pyproject.toml
ray.init(runtime_env={"uv": "./path/requirements.txt"})

Pattern A (Best for Iteration): ray job submit … – uv run …

Keep a repo locally (or on CI) with pyproject.toml, uv.lock, and your scripts. Submit from any machine that can reach the Ingress:

1
2
3
4
5
6
7
uv lock

ray job submit \
  --address="http://<INGRESS_IP_OR_DNS>:8265" \
  --no-wait \
  --working-dir . \
  -- uv run main.py

Ray uploads your working directory and installs dependencies. Use --no-wait for fire-and-forget, then:

1
2
3
ray job logs <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"
ray job status <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"
ray job stop <job-id> --address="http://<INGRESS_IP_OR_DNS>:8265"

Tip: Set export RAY_API_SERVER_ADDRESS="http://<INGRESS_IP_OR_DNS>:8265" to avoid passing --address every time.


Remote working_dir (Avoid Local Upload)

Instead of --working-dir ., use a remote URI so Ray fetches code from GitHub or GCS:

Source Example
Public GitHub https://github.com/user/repo/archive/HEAD.zip
Private GitHub https://user:TOKEN@github.com/user/repo/archive/HEAD.zip
GCS gs://bucket/code.zip

Example with a subdirectory (e.g. src/ in the repo):

1
2
3
4
ray job submit \
  --address="http://<INGRESS_IP_OR_DNS>:8265" \
  --working-dir "https://github.com/user/repo/archive/HEAD.zip" \
  -- uv run --directory src main_src.py

Pattern B (Best for Production)

Bake dependencies into the image to avoid per-job installs:

1
2
3
FROM rayproject/ray-ml:2.x-gpu
COPY pyproject.toml uv.lock .
RUN uv sync --frozen

Use this image in your RayCluster for both head and workers. Then job submission can ship only code or parameters.


Option 3: Remote Code Only (No Local Upload)

If you want to avoid uploading from your machine entirely:

  1. Zip your repo (single top-level directory) and upload to GCS.
  2. Submit with --runtime-env-json and working_dir: "gs://bucket/code.zip":
1
2
3
4
ray job submit \
  --address="http://<INGRESS_IP_OR_DNS>:8265" \
  --runtime-env-json='{"working_dir": "gs://bucket/code.zip"}' \
  -- python main.py

Production Workflow

flowchart TD
    Dev[Developer] -->|Push Code| GitHub
    GitHub --> CI
    CI -->|Build Image| GCR
    GCR -->|Deploy| GKE
    GKE --> RayCluster

Submitting Jobs from Local / Different Machines (via Ingress)

Submit from your laptop, another engineer’s machine, or a CI runner—as long as it has the code and can reach http://<INGRESS_IP>:8265.

1. Install Ray CLI

1
uv tool install "ray[default]"   # runtime env feature requires ray[default]

2. Submit Options

Option Use case
Local dir --working-dir . — uploads current directory
Remote GitHub/GCS --working-dir "https://github.com/user/repo/archive/HEAD.zip" or gs://bucket/code.zip
Subdirectory Add -- uv run --directory src main.py when code lives in a subdir
No local upload --runtime-env-json='{"working_dir": "gs://bucket/code.zip"}'

Example (local):

1
2
3
4
5
uv lock
ray job submit \
  --address="http://<INGRESS_IP>:8265" \
  --working-dir . \
  -- uv run main.py

Example (remote repo + subdirectory):

1
2
3
4
ray job submit \
  --address="http://<INGRESS_IP>:8265" \
  --working-dir "https://github.com/user/repo/archive/HEAD.zip" \
  -- uv run --directory src main.py

Monitoring & Autoscaling

You should configure:

  • Ray autoscaling
  • Prometheus + Grafana
  • Cloud Monitoring integration

Ray metrics default port: 8080


When Should You Use This Stack?

Use Ray + GKE when:

  • Distributed training
  • Multi-GPU LLM serving
  • Batch inference pipelines
  • Multi-team ML platform
  • CI/CD for ML infra

Avoid if:

  • Small experiments
  • Single-node workloads
  • No need for autoscaling

Final Thoughts

Running Ray on GKE gives you:

  • Kubernetes-native autoscaling
  • GPU scheduling
  • Production-ready LLM serving
  • Distributed PyTorch training
  • Clean job submission model

This stack scales from experimentation → production seamlessly.




Citations




If you found this useful, please cite this as:

Nguyen, Quan H. (Feb 2026). Guide to deploy Ray Cluster on GKE. https://quanhnguyen232.github.io.

or as a BibTeX entry:

1
2
3
4
5
6
7
@article{nguyen2026guide-to-deploy-ray-cluster-on-gke,
  title   = {Guide to deploy Ray Cluster on GKE},
  author  = {Nguyen, Quan H.},
  year    = {2026},
  month   = {Feb},
  url     = {https://quanhnguyen232.github.io/blog/2026/deploy-ray/}
}

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Brief of AI/ML coding interview questions (Leetcode style)
  • Brief of CUDA/GPU coding interview questions (Leetcode style)
  • Brief of Data-Structure & Algorithm coding interview questions
  • Comprehensive Guide to Deploying vLLM on GKE