Version: v0.36 Stable

AI Cloud: Managed Kubernetes Service

Run a managed Kubernetes service on your GPU infrastructure. Each customer gets an isolated tenant cluster with dedicated GPU nodes. Your product is what customers interact with. Platform is the operations layer your team runs behind it.

Typical stack: Standalone (HA) as the Control Plane Cluster. Private nodes per customer cluster. vMetal for bare metal GPU lifecycle. vNode for workload runtime isolation.

Enterprise AI platform architecture: central control plane that creates tenant clusters for customers — Central control plane managing tenant clusters for customers

What makes this path different: Customers never touch Platform. They interact with your product. Platform RBAC locks direct access to your platform engineering team. This architecture also maps to the cluster-level isolation criteria that AI cloud buyers evaluate in frameworks like ClusterMAX.

Day 0: Design decisions

Decision	Read next	Outcome
Choose the control plane deployment model	Standalone deployment, Architecture	Control planes as pods on an existing Kubernetes cluster, or Standalone on dedicated CPU nodes. Standalone is the common choice when no prior Kubernetes substrate exists.
Plan bare metal GPU provisioning	vMetal docs, Metal3 node provider, bare metal overview	Decide whether vMetal manages the full machine lifecycle (PXE, OS imaging, BMC, reclaim) or nodes are joined manually or via another provisioner.
Define per-customer node isolation	Private Nodes, node requirements	Each customer's tenant cluster gets its own dedicated GPU node pool with a separate CNI/CSI, eliminating interference between customers.
Plan network isolation	VPN, Netris integration	Tenant clusters connect to their private nodes over an encrypted VPN tunnel. Netris integration adds switch-level VLAN/VXLAN isolation per tenant.
Choose runtime isolation model	vNode docs, Virtual Nodes	vNode provides kernel-level container isolation without VM overhead. Recommended when customers run privileged workloads, dynamic code execution, or need GPU access via CDI.
Define cluster templates and AI stacks	Templates, Certified Stacks	Each customer cluster template includes GPU Operator, a scheduler (Run.ai, Kueue, Volcano), and optionally a developer environment. Certified Stacks provide pre-validated configurations.
Plan the customer-facing provisioning API	Projects, Quotas, Platform API	Your product API calls Platform to provision tenant clusters. Define the project structure, quota model, and automation hooks that back your customer-facing workflows.
Plan durability	Backing store, container control plane HA, Standalone HA, Platform HA	Choose the data store and replica model for Platform and per-customer control planes.

Day 1: Stand up the first production customer cluster

note

Steps 3 and 4 configure Platform for your platform engineering team, not for your customers. Customers provision clusters through your product. Platform access should be restricted to your ops team.

Install vCluster Platform. If building from bare metal, deploy vCluster Standalone first, then move to Standalone HA before production traffic.
Configure backing store and Platform HA.
Configure SSO and permissions for your platform engineering team.
Create projects, templates, quotas, and Auto Nodes to back your customer provisioning workflows.
Set up vMetal and the Metal3 node provider: register BMC credentials, configure PXE networking, define OS images, and verify bare metal hosts reach available.
Configure per-customer network isolation with VPN and, if using Netris, the Netris integration.
Install vNode on eligible GPU nodes. Configure sync.toHost.pods.runtimeClassName: vnode in the cluster template.
Deploy the first customer template using Certified Stacks as the starting point for GPU Operator, scheduler, and AI tooling.
Validate tenant isolation from inside the tenant cluster: confirm the customer cannot see the Control Plane Cluster, other tenants, or platform internals.
Wire your product API to Platform's provisioning endpoints and test the end-to-end customer onboarding flow.

Day 2: Operate

Operation	Read next
Manage bare metal capacity and machine lifecycle	Bare metal overview, Metal3 node provider, vMetal docs
Monitor platform and tenant workloads	Monitoring overview, fleet monitoring
Upgrade Platform and tenant clusters	Upgrade vCluster, upgrade Platform
Back up and restore tenant clusters and Platform	Snapshots, restore, Platform backup
Manage vNode compatibility during upgrades	vNode limitations, vNode configuration
Scale the Control Plane Cluster	Platform HA, multi-region Platform

Day 0: Design decisions​

Day 1: Stand up the first production customer cluster​

Day 2: Operate​

Day 0: Design decisions

Day 1: Stand up the first production customer cluster

Day 2: Operate