Managed Inference Job

A Managed Inference Job runs an open-source language model inside a KubeVirt virtual machine instance (VMI) on your cluster. It serves the model behind an OpenAI-compatible API. You send requests the same way you would to any OpenAI-compatible endpoint.

When to use one

A Managed Inference Job fits when you want to call a language model over an API and let CosmicAC run it for you. You pick an open-source model, and CosmicAC serves it.

If you instead want direct control of a GPU to run your code, a GPU Container Job is the better fit. It gives you a machine and a shell, and you set up the environment yourself.

What you get

An OpenAI-compatible API — your existing clients and SDKs work without changes.
Open-source models — served on your cluster with vLLM.

Models

A Managed Inference Job serves one model. You can serve any model that vLLM supports, including open-source models from the Hugging Face Hub. You identify the model by its Hugging Face model ID, for example Qwen/Qwen3-32B.

You can find a model to serve in the Hugging Face model hub or the vLLM supported models list.

How it works

A Managed Inference Job moves through a simple lifecycle. When you create it from the CLI or the web UI, CosmicAC schedules it on a GPU node in your cluster. CosmicAC then provisions a VMI that serves the model with vLLM. Once the model is serving, you call it through the OpenAI-compatible endpoint, which authenticates your requests and routes them to the running model. Stopping the job pauses it, so you can start it again later. Deleting it releases its resources.

For the component-level path a request takes through CosmicAC, see Architecture.

How you connect

You call the model in two ways. You can send requests from any OpenAI-compatible client, or run inference directly with cosmicac-cli. Both authenticate with an API key.

For the steps to connect a client, see How to connect to a Managed Inference endpoint.