Architecture

How CosmicAC's components connect to your Kubernetes cluster and run each job type.

CosmicAC is a self-hosted platform that runs GPU workloads on your Kubernetes cluster. This page explains the components involved, how they connect to your cluster, and how each job type runs. For deployment steps, see Installation.

Deployment architecture

Setting up your cluster is separate from deploying CosmicAC. You bring a Kubernetes cluster that already has its GPU nodes and KubeVirt configured. The CosmicAC components then connect to that cluster and run your workloads on it.

wrk-server-k8s-nvidia connects to your cluster's Kubernetes API and creates the resources each job needs. For each job request it builds a graph of Kubernetes resources and applies them through the API server. For every workload it provisions a GPU VMI, a KubeVirt Virtual Machine Instance. CDI imports the VMI's root disk from a registry image, and the VMI claims one or more whole GPUs through PCI passthrough. Multi-node workloads also claim InfiniBand.

The cluster the worker drives has two requirement tiers.

Base platform — the Kubernetes, GPU, virtualization, storage, and registry foundation that every workload needs.
Overlay networking — an optional add-on tier, needed only when a workload requires a per-instance isolated network, an OVS/VXLAN subcluster reachable through a WireGuard gateway. This capability is independent of the workload type.

CosmicAC documents the cluster requirements, not the steps to build the cluster. See Requirements for those requirements.

Racks

A rack is a machine registered into your CosmicAC network and tracked by the orchestrator (wrk-ork). Racks tell the platform what infrastructure is available to schedule onto. Each rack has an ID, a type, and an RPC public key that the orchestrator uses to reach it.

There are two rack types.

server — a GPU server that runs jobs. It declares a location and a GPU type, and its GPUs become the available capacity the platform schedules onto. CosmicAC aggregates available GPUs and pricing per server rack by location.
dataseeder — a node that seeds data across the network.

You register and manage racks with the CLI. See How to manage racks in your CosmicAC network.

CosmicAC components

These components make up CosmicAC. Most run outside your cluster as part of the self-hosted platform, and the per-job agents run inside each job's VMI.

app-ui — web interface that provides a browser dashboard for creating and managing jobs.
cosmicac-cli — command-line interface that submits jobs, manages resources, and connects to containers from your terminal.
app-node — application server that serves the HTTP API, authenticates requests, and routes commands to the orchestrator.
wrk-ork — orchestrator that allocates resources, distributes jobs across the cluster, and routes requests to the workers.
wrk-server-k8s-nvidia — Kubernetes server worker that connects to your cluster's Kubernetes API and provisions the GPU VMs.
proxy-inference — inference proxy that authenticates Managed Inference requests, balances load, and routes them to model servers.
wrk-agent-instance — GPU Container agent that runs inside a GPU Container Job's VMI and accepts shell sessions over hyperswarm-ssh.
wrk-agent-inference — Managed Inference agent that runs inside a Managed Inference Job's VMI, serves the model with vLLM, and registers itself in the DHT table.
redis — in-memory data store that app-node uses for caching and runtime state, with persistence enabled.
caddy — web entry point that serves the UI and reverse-proxies API and inference traffic on port 5173.

Caddy is the external entry point on port 5173. It serves the UI and proxies /api to app-node and /inference to proxy-inference.

Holepunch stack

Inside CosmicAC, the components connect to each other over the Holepunch peer-to-peer (p2p) stack rather than through a central server. Components address each other directly, so there's no central broker to route, bottleneck, or expose internal traffic:

Hyperswarm — peer-to-peer networking that lets components find and connect to each other directly, without a central broker.
HRPC — Hyperswarm RPC that carries internal calls between app-node, wrk-ork, and the workers.
hyperswarm-ssh — SSH over Hyperswarm that lets cosmicac-cli shell directly into a running GPU Container Job.
DHT table — distributed hash table where Managed Inference model servers register, and proxy-inference discovers them by topic.
HyperDB + Autobase — distributed database that stores usage metrics and job metadata.

GPU Container architecture

A GPU Container Job runs your workload inside a KubeVirt VMI with a GPU and shell access.

How a job starts. When you submit a job from app-ui or cosmicac-cli, it travels through the CosmicAC components to your cluster.

app-node authenticates the request and forwards it to wrk-ork.
wrk-ork routes the job to wrk-server-k8s-nvidia.
wrk-server-k8s-nvidia instructs the Kubernetes control plane to schedule the workload.
Kubernetes creates a pod containing a VMI, with wrk-agent-instance running inside it.

How a shell connects. Once the VMI is running, cosmicac-cli connects directly to wrk-agent-instance over hyperswarm-ssh. Your commands reach the VMI over the Holepunch p2p stack rather than through app-node, so the interactive session doesn't depend on the control path that submitted the job.

Managed Inference architecture

A Managed Inference Job runs an open-source language model with vLLM inside a VMI. It exposes the model through proxy-inference as an OpenAI-compatible endpoint, which authenticates requests and balances load. You reach the model through that endpoint from any OpenAI-compatible client, or by running inference directly with cosmicac-cli.

How a job starts. When you create a Managed Inference Job from app-ui, the request flows through app-node and wrk-ork to wrk-server-k8s-nvidia. The worker schedules a pod with a VMI running wrk-agent-inference (vLLM). On spin-up, wrk-agent-inference registers itself in the DHT table so the proxy can find it.

How a request is served. Serving traffic follows a separate path from job creation:

A client sends a request to the inference endpoint over the OpenAI-compatible API, or you run inference from cosmicac-cli.
proxy-inference authenticates the request, searches the DHT table by topic to discover a model server, and balances load across the running servers.
wrk-agent-inference runs the request with vLLM and returns the response.

Job lifecycle

A job moves through these states from when you create it until you delete it.

Stopping a job pauses it, and you can start it again later. Deleting a job removes it and its allocated resources.

Isolation and security

VM-level isolation — each job runs in its own KubeVirt VMI inside a non-privileged pod, with Kubernetes security controls applied.
Secure GPU access — CosmicAC exposes GPUs to the VMIs without privileged containers.

Deployment architecture

Racks

CosmicAC components

Holepunch stack

GPU Container architecture

Managed Inference architecture

Job lifecycle

Isolation and security

Next steps

GPU Container Job

Managed Inference

Installation

On this page