Skip to main content
Version: main 🚧

Enterprise AI Factory

Give every team in your organization a production-grade AI environment without the overhead of managing separate clusters. Production ML teams get private GPU nodes and full isolation. Dev and experiment teams share a node pool and provision on demand. Platform keeps everything governed, audited, and within budget.

Typical stack: Platform on an existing cloud Kubernetes cluster or Standalone on-premises. Private nodes for production ML teams. Shared node pool for dev and experiment workloads. vMetal optional for on-premises GPU fleets.

Enterprise AI factory architecture: central control plane managing private GPU nodes for production ML teams and a shared node pool for dev and experiment workloads
Central control plane managing private GPU nodes for ML teams alongside a shared pool for dev workloads

What makes this path different: Your teams use Platform directly, or through an internal portal built on top of it. Self-service provisioning, standardized GPU environments, and chargeback visibility are what you are building toward.

Day 0: Design decisions​

DecisionRead nextOutcome
Decide the Control Plane Cluster foundationArchitecture, Deployment basicsRun the Control Plane Cluster on an existing managed Kubernetes service (EKS, AKS, GKE) or vCluster Standalone on on-premises servers.
Define team tenancy tiersPrivate Nodes, Deployment basicsClassify teams: production AI workloads on private GPU nodes, dev and experiment on shared nodes.
Define project and quota structureProjects, QuotasMap projects to teams or business units. Set GPU, CPU, and memory quotas per project.
Plan GPU tooling and AI stacksCertified StacksStandardize GPU Operator, scheduler (Run.ai, Kueue, Volcano), and developer environment (Jupyter, VS Code) across teams using Certified Stacks.
Choose the runtime isolation modelvNode docsvNode is recommended when teams run agentic workloads, untrusted code execution, or need root access inside containers without the risk of node escape.
Plan SSO and governanceSSO, RBACIntegrate with your corporate identity provider. Define which teams can self-service create clusters and which require approval.

Day 1: Stand up the first production environment​

  1. Install vCluster Platform on your Control Plane Cluster.
  2. Configure SSO, teams, and permissions against your corporate identity provider.
  3. Create projects, templates, and quotas per team or business unit.
  4. Configure the shared node pool: pod security standard, network policy, and resource quota.
  5. Configure private GPU nodes for production teams: join nodes manually or configure Auto Nodes.
  6. Set up vMetal if on-premises GPU servers are part of the fleet.
  7. Deploy Certified Stacks as the baseline template for GPU-enabled tenant clusters.
  8. Install vNode on nodes designated for agentic or privileged workloads.
  9. Provision the first tenant cluster for each tier (shared and private) and validate GPU access, isolation, and quota enforcement.
  10. Document how teams request clusters, retrieve kubeconfigs, request quota increases, and report issues.

Day 2: Operate​

OperationRead next
Monitor platform and team workloadsFleet monitoring, monitoring overview
Manage private node capacityAuto Nodes, manage private nodes
Upgrade clusters and PlatformUpgrade vCluster, upgrade Platform
Back up and restoreSnapshots, Platform backup