Skip to main content
Version: v0.34 Stable

AI Cloud: Managed Kubernetes Service

Run a managed Kubernetes service on your GPU infrastructure. Each customer gets an isolated tenant cluster with dedicated GPU nodes. Your product is what customers interact with. Platform is the operations layer your team runs behind it.

Typical stack: Standalone (HA) as the Control Plane Cluster. Private nodes per customer cluster. vMetal for bare metal GPU lifecycle. vNode for workload runtime isolation.

Enterprise AI platform architecture: central control plane that creates tenant clusters for customers
Central control plane managing tenant clusters for customers

What makes this path different: Customers never touch Platform. They interact with your product. Platform RBAC locks direct access to your platform engineering team.

Day 0: Design decisions​

DecisionRead nextOutcome
Choose the control plane deployment modelStandalone deployment, ArchitectureControl planes as pods on an existing Kubernetes cluster, or Standalone on dedicated CPU nodes. Standalone is the common choice when no prior Kubernetes substrate exists.
Plan bare metal GPU provisioningvMetal docs, Metal3 node provider, bare metal overviewDecide whether vMetal manages the full machine lifecycle (PXE, OS imaging, BMC, reclaim) or nodes are joined manually or via another provisioner.
Define per-customer node isolationPrivate Nodes, node requirementsEach customer's tenant cluster gets its own dedicated GPU node pool with a separate CNI/CSI, eliminating interference between customers.
Plan network isolationVPN, Netris integrationTenant clusters connect to their private nodes over an encrypted VPN tunnel. Netris integration adds switch-level VLAN/VXLAN isolation per tenant.
Choose runtime isolation modelvNode docs, Virtual NodesvNode provides kernel-level container isolation without VM overhead. Recommended when customers run privileged workloads, dynamic code execution, or need GPU access via CDI.
Define cluster templates and AI stacksTemplates, Certified StacksEach customer cluster template includes GPU Operator, a scheduler (Run.ai, Kueue, Volcano), and optionally a developer environment. Certified Stacks provide pre-validated configurations.
Plan the customer-facing provisioning APIProjects, Quotas, Platform APIYour product API calls Platform to provision tenant clusters. Define the project structure, quota model, and automation hooks that back your customer-facing workflows.
Plan durabilityBacking store, container control plane HA, Standalone HA, Platform HAChoose the data store and replica model for Platform and per-customer control planes.

Day 1: Stand up the first production customer cluster​

note

Steps 3 and 4 configure Platform for your platform engineering team, not for your customers. Customers provision clusters through your product. Platform access should be restricted to your ops team.

  1. Install vCluster Platform. If building from bare metal, deploy vCluster Standalone first, then move to Standalone HA before production traffic.
  2. Configure backing store and Platform HA.
  3. Configure SSO and permissions for your platform engineering team.
  4. Create projects, templates, quotas, and Auto Nodes to back your customer provisioning workflows.
  5. Set up vMetal and the Metal3 node provider: register BMC credentials, configure PXE networking, define OS images, and verify bare metal hosts reach available.
  6. Configure per-customer network isolation with VPN and, if using Netris, the Netris integration.
  7. Install vNode on eligible GPU nodes. Configure sync.toHost.pods.runtimeClassName: vnode in the cluster template.
  8. Deploy the first customer template using Certified Stacks as the starting point for GPU Operator, scheduler, and AI tooling.
  9. Validate tenant isolation from inside the tenant cluster: confirm the customer cannot see the Control Plane Cluster, other tenants, or platform internals.
  10. Wire your product API to Platform's provisioning endpoints and test the end-to-end customer onboarding flow.

Day 2: Operate​

OperationRead next
Manage bare metal capacity and machine lifecycleBare metal overview, Metal3 node provider, vMetal docs
Monitor platform and tenant workloadsMonitoring overview, fleet monitoring
Upgrade Platform and tenant clustersUpgrade vCluster, upgrade Platform
Back up and restore tenant clusters and PlatformSnapshots, restore, Platform backup
Manage vNode compatibility during upgradesvNode limitations, vNode configuration
Scale the Control Plane ClusterPlatform HA, multi-region Platform