Skip to main content
Back to Blog

Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)

11 min readUpdated

labgpukubernetesmlc-llmrocminferenceschedulinghomelabflexinfer

Five models crammed onto one GPU, and only one could run at a time. That was the state of my FlexInfer cluster before this week. The other 7900 XTX node had 12GB of free VRAM sitting idle. I spent an afternoon fixing the layout, and the result is a setup that's faster, more available, and actually uses both text-generation GPUs.

This post covers the before/after, the design decisions, the benchmarks, and the operational details that made it work.

TL;DR

  • Split 5 models from one 7900 XTX node into two dedicated text lanes: quality (14B always-on + 32B/R1 on-demand) and vision+fast (vision always-on + 8B when vision idles).
  • On the current ai.flexinfer/v1alpha2 Model CRD, spec.gpu.shared enforces mutual exclusion per GPU and spec.gpu.priority decides which model wins. Higher priority = stays loaded.
  • serviceLabels drive proxy routing. OpenAI-compatible aliases (gpt-4, copilot, gpt-4o) resolve through labels (and can overlap for failover).
  • Measured 74.8 TPS sustained on qwen3-14b-mlc (7900 XTX, MLC-LLM) and 77.7 TPS on qwen3-vl-vision (llama.cpp) for text workloads. All 12 aliases route correctly with ~300ms latency for short, warm requests.
  • Removed the 0.5B vLLM test model and a dead NFS reference. Net result: fewer models, better distribution, more throughput.

The problem: 5 models, 1 GPU

Before this change, cblevins-7900xtx (24GB RX 7900 XTX) was running five text models in a single shared group. Only one could be active at a time:

NodeModelsVRAM Used
cblevins-7900xtx (24GB)qwen3-14b-mlc, qwen3-8b-fast, qwen3-32b-quality, deepseek-r1, qwen25-05b-vllm~16GB (1 active)
cblevins-5930k (24GB)qwen3-vl-vision~12GB
cblevins-gtx980ti (6GB)sdxl-turbo-imagegen~5GB

The 5930k node had 12GB of free VRAM doing nothing. Meanwhile, on the 7900xtx, requesting a non-primary model meant a cold swap: drain the active model, load the new one, serve the request. That's 30–120 seconds of latency depending on model size.

The design: two lanes

The fix was straightforward: split the text models across both 7900 XTX nodes by workload type.

Quality lane (cblevins-7900xtx): the 14B model stays always-on as the universal fallback. The 32B and R1 models activate on-demand for premium quality or reasoning, preempting 14B temporarily.

Vision + fast lane (cblevins-5930k): the vision model stays always-on. When vision idles out (10-minute timeout), the 8B fast model activates automatically, giving you a second text endpoint for copilot and chat traffic.

The final layout:

cblevins-7900xtx (24GB)        cblevins-5930k (24GB)         cblevins-gtx980ti (6GB)
"Quality Lane"                  "Vision + Fast Lane"           "Media Lane"
────────────────────────        ────────────────────────       ─────────────────────
Shared: 7900xtx-quality         Shared: 5930k-models           sdxl-turbo [ON]
├─ qwen3-14b-mlc  [P:100]      ├─ qwen3-vl-vision [P:100]
├─ qwen3-32b-qual [P:80]       └─ qwen3-8b-fast   [P:90]
└─ deepseek-r1    [P:70]

In the repo, those are current apiVersion: ai.flexinfer/v1alpha2, kind: Model manifests, not legacy ModelDeployment resources. The concrete examples I used while making this change were deploy/models/qwen3-14b-mlc.yaml, deploy/models/fast-chat.yaml, and deploy/models/reasoning.yaml.

How shared groups work

On the recommended v1alpha2 API, Model.spec.gpu.shared creates the mutual exclusion group. Models with the same shared value compete for the same GPU, and Model.spec.gpu.priority determines which model stays loaded when multiple want to run.

apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
  name: qwen3-14b-mlc
spec:
  gpu:
    shared: 7900xtx-quality
    priority: 100 # highest in group = stays loaded

In the quality lane, 14b-mlc at P:100 always wins over 32b-quality (P:80) and deepseek-r1 (P:70). When someone explicitly requests the 32B model, FlexInfer preempts 14b-mlc, loads 32B, serves the request, and then after the idle timeout, 14b-mlc reclaims the GPU.

The spec.serverless.minReplicas field controls the "on" vs "available" distinction on the current Model CRD:

  • minReplicas: 1 = always-on (vision, 14b-mlc)
  • minReplicas: 0 = scale-to-zero when preempted (8b-fast activates only when vision idles)

Alias routing: serviceLabels, not LiteLLM

One thing I learned during benchmarking: FlexInfer's proxy routes by serviceLabels, not by the litellm.aliases field. On v1alpha2, spec.litellm is still useful, but it exists so the controller can project LiteLLM discovery annotations. The proxy resolves model names by matching against the spec.serviceLabels array on each Model.

This means if you want gpt-4 to resolve to your 14B model, you need:

spec:
  serviceLabels:
    - fast-chat
    - quality-chat
    - gpt-4 # OpenAI-compatible alias
    - gpt-3.5-turbo
    - copilot

There's a CRD-enforced limit of 10 service labels per model on the current v1alpha2 schema. I hit this when I tried to add all the aliases to 14b-mlc (which had 8 existing labels plus 6 new ones). The fix was dropping the redundant -text variants since -chat covers the same routing purpose.

Shared labels and priority routing

When two models share a service label (e.g., both 14b-mlc and 8b-fast have fast-chat), the proxy routes to whichever model is Ready. If both are Ready (during coding sessions with no vision traffic), the proxy load-balances between them. If one is Idle/preempted, traffic routes to the other automatically.

One gotcha: if the same label exists on a Ready model and an Idle model, the proxy may try to activate the Idle one and timeout. I hit this with reasoning: it was on both deepseek-r1 (Idle, P:70) and 14b-mlc (Ready, P:100). The proxy picked R1, tried to activate it, but R1 couldn't preempt the higher-priority 14b-mlc. The fix: remove overlapping labels from lower-priority models that can't self-activate. Use the direct model name (deepseek-r1-reasoning) when you specifically want R1.

Benchmarks

All measurements taken from inside the cluster (pod-to-service, no ingress overhead). Three runs per test, averaged. Workload: single-request, non-streaming, measured with curl -w '%{time_starttransfer}' as a convenient wall time. (With non-streaming responses, it's close to time-to-completion and good enough for relative comparisons.)

Quality lane (qwen3-14b-mlc on 7900 XTX, MLC-LLM)

AliasMax tokensAvg wallAvg TPS
qwen3-14b-mlc (direct)200.372s55.1
fast-chat1001.337s74.8
quality-chat801.103s72.6
gpt-4100.258s39.1
gpt-3.5-turbo300.448s67.1
copilot500.778s64.8
textgen1001.354s73.8
reasoning600.854s70.5
o1-preview801.080s74.1

Sustained throughput:

LengthAvg wallAvg TPS
300 tokens4.0s74.9
500 tokens12.1s41.4

The TPS drop at 500 tokens is expected: the KV cache grows with sequence length, and per-token compute/memory traffic increases as context grows. For typical copilot bursts (10–100 tokens), you get the full ~75 TPS.

Vision lane (qwen3-vl-vision on 5930k, llama.cpp)

AliasMax tokensAvg wallAvg TPS
vision600.773s77.7
gpt-4o20.071s28.3
ocr200.290s69.2

The vision model (8B, Q4_K_M, llama.cpp) slightly outperforms the 14B text model on TPS, likely because it's smaller and easier to keep saturated.

Routing burst (all 12 aliases, 5 tokens each)

AliasLatencyRouted to
fast-chat0.196sqwen3-14b-mlc
gpt-40.139sqwen3-14b-mlc
copilot0.153sqwen3-14b-mlc
quality-chat0.185sqwen3-14b-mlc
gpt-3.5-turbo0.189sqwen3-14b-mlc
textgen0.301sqwen3-14b-mlc
reasoning0.165sqwen3-14b-mlc
o1-preview0.181sqwen3-14b-mlc
vision0.124sqwen3-vl-vision
gpt-4o0.094sqwen3-vl-vision
ocr0.104sqwen3-vl-vision
dall-e-30.061ssdxl-turbo-imagegen

12/12 aliases routing correctly, all around 0.3s for short requests.

Implementation details

Storage: NVMe hostPath PVs

The 8B model needed to be compiled on the 5930k node (same gfx1100 arch, but a local copy). I added a 20Gi hostPath PV/PVC on the 5930k's NVMe:

spec:
  hostPath:
    path: /var/lib/flexinfer/qwen3-8b-abliterated-mlc-nvme
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: ['cblevins-5930k']

Compile job

MLC-LLM requires a GPU-local compilation step to produce the .so library for ROCm gfx1100. I created a one-time Kubernetes Job targeting the 5930k node. The compile job downloads from HuggingFace, quantizes to q4f32_1, generates config, and compiles the model library, all in one shot.

One operational detail: the compile job needs amd.com/gpu: 1, which means it competes with the vision model for the GPU. I had to scale the vision deployment to zero, let the compile job run, then scale vision back up. FlexInfer's shared group preemption operates at the application layer, not the K8s scheduler layer, so the scheduler can't preempt a running pod for a Job.

Cache-check timing

After compilation, FlexInfer runs a cache-check job to verify the model files exist on the PVC before transitioning the model to Idle. If the cache-check runs before compilation finishes (which it did in my case), it fails and the model stays in Pending. Deleting the failed cache-check job triggers a re-check, and the model transitions to Idle once it finds the compiled artifacts.

GPUGroup CRDs

I also updated the legacy GPUGroup CRDs to reflect the new layout (two groups instead of one). Those resources are part of the older ai.flexinfer/v1alpha1 API alongside ModelDeployment; they are not the primary way current models are defined. The active model API is ai.flexinfer/v1alpha2, kind: Model, and the lane behavior in this post comes from spec.gpu.shared, spec.gpu.priority, and spec.serverless. I keep GPUGroup in sync mostly as operational bookkeeping and compatibility context, not because new model definitions depend on it.

What I cleaned up

  • Removed qwen25-05b-vllm: The 0.5B vLLM test model was superseded by the 8B fast model. Removed from kustomization.
  • Deleted qwen3-14b-abliterated.yaml: Dead file referencing a down NFS server. Not in kustomization but cluttering the directory.

What I'd do differently

The serviceLabels limit (10 per model) caught me off guard. If I were designing the label strategy from scratch, I'd use fewer, more semantic labels (text, vision, code, image) rather than mirroring every OpenAI model name. The OpenAI aliases are convenient for drop-in compatibility, but they eat into a limited budget.

The compile-before-deploy ordering is also something I'd automate. Right now it's manual: scale down vision, run compile job, scale up vision. A pre-deploy hook or an init container that checks for compiled artifacts would make this smoother.

Takeaways

  1. Distribute models by usage pattern, not just by size. Putting "always-on" and "on-demand" models on the same GPU is fine with shared groups, but splitting across nodes gives you parallelism you can't get from time-sharing.
  2. On v1alpha2 Model, serviceLabels are the routing primitive. If the proxy can't find your model, check labels first. spec.litellm is for LiteLLM discovery, not the main proxy label match.
  3. Shared labels on models with different priorities can cause timeouts. Keep aliases on the highest-priority (always-Ready) model. Use direct model names for on-demand models.
  4. MLC-LLM on gfx1100 sustains ~75 TPS for 14B models at moderate context lengths. That's fast enough for copilot, chat, and code generation workloads on a homelab.
  5. Both 7900 XTX nodes perform similarly despite one being on a 2014 Haswell-E platform (i7-5930K) and the other on a 2023 Zen 4 (Ryzen 9 7900X3D). For these workloads, inference is GPU-bound enough that CPU generation didn't move the needle.

Related posts:


Update (2026-04-18): two lanes became four, and the preemption semantics matured

The two-lane framing (quality on 7900xtx, vision+fast on 5930k, plus a media lane on 980ti) is still correct. Two months on, the serving plane grew a fourth lane and the preemption behavior picked up enough scar tissue to be worth documenting.

The fleet now

NodeArchVRAMRoleTypical workloads
cblevins-7900xtxgfx110024 GBPrimary text-gen + 32K canaryGemma 4 26B-A4B (primary), Gemma 4 26B-A4B 32K canary (on-demand swap)
cblevins-5930kgfx110024 GBText-gen + image-gen co-tenantsOmniCoder 9B GPTQ, Qwen3.5 9B GPTQ (staged), FluxPony image generation
cblevins-radeonviigfx90616 GB HBM2Long-context + quant workGemma 4 31B GPTQ, Qwen3.5 27B opus distill, SDXL inpainting
cblevins-gtx980tism_526 GBEmbeddings + legacynomic-embed-text (Ollama)

Two GPU archs (gfx1100 x2, gfx906), not one. MLC is a cross-cutting lane rather than a single-node story: the qwen3-{0.6, 4, 8, 14, 32}B-mlc family serves serverless off a shared NFS cache regardless of which physical node picks them up.

What the shared groups look like today

The single quality-textgen shared group became:

  • 7900xtx-textgen (cblevins-7900xtx): two Gemma 4 26B-A4B Model CRs at priority 200 (gemma4-26b-a4b-gptq primary, gemma4-26b-a4b-gptq-long 32K canary). Priorities are equal; the canary can swap in on demand when the primary goes idle, not preempt mid-serve.
  • 5930k-imagegen-textgen: image generation (FluxPony, SDXL) at priority 200, text-gen (OmniCoder, staged Qwen3.5) at priority 150. Image has preemption rights; text yields when a render request lands.
  • radeonvii-models: Gemma 4 31B and SDXL inpainting at equal priority. Workload mix means these rarely contend.

The point is not the specific priorities. The point is that "two lanes" was a static allocation; "shared groups across four nodes" is a policy, and the policy has to specify what happens when two co-tenants want the same GPU at the same time.

Preemption semantics that matured

The original post described serviceLabels and priority. The enforcement got more careful:

  • Loading-phase guard. The controller holds replicas at 1 whenever Model.status.phase == Loading, regardless of LastActiveTime staleness. Without this, a serverless model that took 10 min to load would get reaped mid-load because the proxy only writes LastActiveTime once at request arrival. This is the bug that made serverless oscillate between 0 and 1 replicas during a single cold start.
  • Equal-priority co-tenants do not preempt each other. They swap when the active one is idle, with a cooldown to prevent thrashing. The primary/canary pair above relies on this.
  • Preempt-to-ready latency is an SLI. See the update to slos-for-inference; flexinfer_model_swap_duration_seconds is the histogram and p95 > a few seconds means the "shared group" illusion is breaking.

What broke in the meantime

Three failure modes worth calling out:

  • Longhorn cross-node replica reads stalling mmap loads. 3-replica Longhorn cache PVC plus a swap into a cold cache plus vLLM's mmap loader equals an 8m47s stall on one safetensors shard while FlexDeck shows "Loading" the whole time. cold-start-stall-loadingsubstage walks the incident. Fix is single-replica local storage for serving-path caches plus the new Model.status.loadingSubstage fields.
  • max_tokens = max_model_len means 0 prompt budget. Clients that default max_tokens to the advertised context window produce a 100% 400 rate at vLLM because the implied prompt budget is 0. The proxy now clamps max_tokens to context_window - 512 at the forward seam.
  • Proxy fail-fast on stalled loads. When a cold-start wedges, the proxy was queuing fresh requests indefinitely. It now returns 503 + Retry-After after 120 s of no LoadingProgressAt advancement on a LoadingWeights substage.

What I would change about the original two-lane framing

If I were redrawing the diagram today:

  • Name the groups, not the nodes. "Quality lane on 7900xtx" ties policy to hardware; "7900xtx-textgen group" separates the policy (priority, preemption) from the fleet (which node the group happens to live on). Makes node swaps cheaper when hardware changes.
  • Include the MLC lane as a cross-cutting shared-cache surface, not a per-node story.
  • Add the observability boxes: FlexDeck rendering phase + substage + message, proxy clamp + stall counters, per-group preemption and swap-latency panels.

The two-lane pattern is still the right starting shape. It just grows into "n lanes + policy" faster than you might expect once more than two models are serverless on the same cluster.

Related Articles

Comments

Join the discussion. Be respectful.