Building a GPU cloud platform
AI cloud providers face a specific problem: enterprise AI teams expect dedicated Kubernetes clusters with full isolation, GPU Operator support, and the same operational behavior they get from hyperscaler Kubernetes services. Building and operating a real cluster per customer is expensive and operationally unsustainable. Sharing a single cluster across tenants introduces noisy neighbors and hard-to-enforce boundaries.
vCluster solves this by virtualizing the Kubernetes control plane. Every tenant gets a dedicated API server, their own CRDs, RBAC, and a complete cluster experience. From the tenant's perspective it's indistinguishable from a real cluster. For the provider, it provisions in seconds and vCluster Platform manages it centrally — without the overhead of a physical cluster per customer.
How it fits into your platform​
A vCluster deployment has two layers:
- A Control Plane ClusterControl Plane ClusterThe Kubernetes cluster that hosts the virtualized control planes for tenant clusters. The Control Plane Cluster is operated by the platform provider and is completely invisible to tenants. There are no shared control plane nodes, no in-cluster agent pods, and no lateral path between tenant environments. With shared nodes, this cluster also runs tenant workloads alongside the control plane pods — the same node pool is used for both. that hosts the virtualized control planes for your tenant clusters. This can be an existing Kubernetes cluster, a purpose-built cluster, or a vCluster Standalone instance on your own infrastructure.
- Tenant clustersTenant ClusterA fully isolated Kubernetes environment provisioned for a single tenant. Each tenant cluster has its own API server, controller manager, and resource namespace, backed by a virtualized control plane hosted on a Control Plane Cluster. From the tenant's perspective it behaves exactly like a standard Kubernetes cluster. that run on top of that infrastructure. Each tenant cluster has its own API server, controller manager, data store, and lifecycle.
vCluster Platform sits above both layers. It handles provisioning, access control, lifecycle management, and node automation across all your tenant clusters and Control Plane Clusters.
Worker nodes for GPU workloads​
The worker node model determines how isolated each tenant's compute is. For AI cloud workloads, the choice typically comes down to whether tenants need fully dedicated infrastructure or whether shared infrastructure with per-tenant node pools is sufficient.
Private nodes​
Each tenant cluster runs on its own dedicated physical nodes, with a separate CNI, CSI, and control plane. No compute, network, or storage is shared. Tenants get full NCCL bandwidth and hardware-level separation.
Private nodes are the right choice for:
- Multi-node distributed training and large model inference
- Customers requiring compliance or regulatory isolation
- Workloads where performance predictability is non-negotiable
Shared infrastructure with dedicated node pools​
Each tenant cluster gets exclusive access to a labeled node pool on the Control Plane Cluster. Compute is scoped per tenant without provisioning separate infrastructure. CNI, CSI, and platform services are shared.
This is the right choice for:
- Per-tenant GPU node pools with elastic scaling
- Customers who need compute separation but not full infrastructure isolation
- Environments where shared platform services are acceptable
Dedicated Nodes in the architecture overview →
Add Virtual Nodes for stronger isolation​
Virtual Nodes add a strong isolation boundary at the node level using vNode. vNode enforces scheduling boundaries per tenant, giving each tenant its own view of the node environment.
Virtual Nodes are not a deployment model. They are an optional isolation layer that applies to either worker node choice above — private or shared. This makes them valuable when strict per-tenant workload separation is required regardless of how the underlying nodes are deployed.
Bootstrap with vCluster Standalone​
Every vCluster deployment needs a Control Plane Cluster. For AI cloud providers standing up infrastructure from scratch, vCluster Standalone is a natural choice.
vCluster Standalone is a complete, zero-dependency Kubernetes distribution. It runs on bare metal or VMs with no dependency on any other Kubernetes distribution. Once deployed, it behaves like any Kubernetes cluster. You install vCluster Platform on top of it, deploy tenant clusters, and join bare metal worker nodes — all using vCluster tooling, with no external dependencies.
This solves the "cluster one" problem: you can bootstrap your entire platform stack from bare metal without adopting another Kubernetes distribution.
Deploy vCluster Standalone →
Next steps​
- Architecture — understand control plane deployment options, worker node models, and the syncer
- Private Nodes — deploy dedicated GPU infrastructure per tenant
- vCluster Standalone — bootstrap your Control Plane Cluster on bare metal
- vCluster Platform — centralized management, access control, and node automation
- vNode — workload isolation layer for GPU tenants