Two-Lane Text GPU Allocation: Quality + Vision/Fast (Plus a Media Lane)
11 min readUpdated
Five models crammed onto one GPU, and only one could run at a time. That was the state of my FlexInfer cluster before this week. The other 7900 XTX node had 12GB of free VRAM sitting idle. I spent an afternoon fixing the layout, and the result is a setup that's faster, more available, and actually uses both text-generation GPUs.
This post covers the before/after, the design decisions, the benchmarks, and the operational details that made it work.
TL;DR
- Split 5 models from one 7900 XTX node into two dedicated text lanes: quality (14B always-on + 32B/R1 on-demand) and vision+fast (vision always-on + 8B when vision idles).
- On the current
ai.flexinfer/v1alpha2ModelCRD,spec.gpu.sharedenforces mutual exclusion per GPU andspec.gpu.prioritydecides which model wins. Higher priority = stays loaded. serviceLabelsdrive proxy routing. OpenAI-compatible aliases (gpt-4,copilot,gpt-4o) resolve through labels (and can overlap for failover).- Measured 74.8 TPS sustained on qwen3-14b-mlc (7900 XTX, MLC-LLM) and 77.7 TPS on qwen3-vl-vision (llama.cpp) for text workloads. All 12 aliases route correctly with ~300ms latency for short, warm requests.
- Removed the 0.5B vLLM test model and a dead NFS reference. Net result: fewer models, better distribution, more throughput.
The problem: 5 models, 1 GPU
Before this change, cblevins-7900xtx (24GB RX 7900 XTX) was running five text models in a single shared group. Only one could be active at a time:
| Node | Models | VRAM Used |
|---|---|---|
cblevins-7900xtx (24GB) | qwen3-14b-mlc, qwen3-8b-fast, qwen3-32b-quality, deepseek-r1, qwen25-05b-vllm | ~16GB (1 active) |
cblevins-5930k (24GB) | qwen3-vl-vision | ~12GB |
cblevins-gtx980ti (6GB) | sdxl-turbo-imagegen | ~5GB |
The 5930k node had 12GB of free VRAM doing nothing. Meanwhile, on the 7900xtx, requesting a non-primary model meant a cold swap: drain the active model, load the new one, serve the request. That's 30–120 seconds of latency depending on model size.
The design: two lanes
The fix was straightforward: split the text models across both 7900 XTX nodes by workload type.
Quality lane (cblevins-7900xtx): the 14B model stays always-on as the universal fallback. The 32B and R1 models activate on-demand for premium quality or reasoning, preempting 14B temporarily.
Vision + fast lane (cblevins-5930k): the vision model stays always-on. When vision idles out (10-minute timeout), the 8B fast model activates automatically, giving you a second text endpoint for copilot and chat traffic.
The final layout:
cblevins-7900xtx (24GB) cblevins-5930k (24GB) cblevins-gtx980ti (6GB)
"Quality Lane" "Vision + Fast Lane" "Media Lane"
──────────────────────── ──────────────────────── ─────────────────────
Shared: 7900xtx-quality Shared: 5930k-models sdxl-turbo [ON]
├─ qwen3-14b-mlc [P:100] ├─ qwen3-vl-vision [P:100]
├─ qwen3-32b-qual [P:80] └─ qwen3-8b-fast [P:90]
└─ deepseek-r1 [P:70]
In the repo, those are current apiVersion: ai.flexinfer/v1alpha2, kind: Model manifests, not legacy ModelDeployment resources. The concrete examples I used while making this change were deploy/models/qwen3-14b-mlc.yaml, deploy/models/fast-chat.yaml, and deploy/models/reasoning.yaml.
How shared groups work
On the recommended v1alpha2 API, Model.spec.gpu.shared creates the mutual exclusion group. Models with the same shared value compete for the same GPU, and Model.spec.gpu.priority determines which model stays loaded when multiple want to run.
apiVersion: ai.flexinfer/v1alpha2
kind: Model
metadata:
name: qwen3-14b-mlc
spec:
gpu:
shared: 7900xtx-quality
priority: 100 # highest in group = stays loaded
In the quality lane, 14b-mlc at P:100 always wins over 32b-quality (P:80) and deepseek-r1 (P:70). When someone explicitly requests the 32B model, FlexInfer preempts 14b-mlc, loads 32B, serves the request, and then after the idle timeout, 14b-mlc reclaims the GPU.
The spec.serverless.minReplicas field controls the "on" vs "available" distinction on the current Model CRD:
minReplicas: 1= always-on (vision, 14b-mlc)minReplicas: 0= scale-to-zero when preempted (8b-fast activates only when vision idles)
Alias routing: serviceLabels, not LiteLLM
One thing I learned during benchmarking: FlexInfer's proxy routes by serviceLabels, not by the litellm.aliases field. On v1alpha2, spec.litellm is still useful, but it exists so the controller can project LiteLLM discovery annotations. The proxy resolves model names by matching against the spec.serviceLabels array on each Model.
This means if you want gpt-4 to resolve to your 14B model, you need:
spec:
serviceLabels:
- fast-chat
- quality-chat
- gpt-4 # OpenAI-compatible alias
- gpt-3.5-turbo
- copilot
There's a CRD-enforced limit of 10 service labels per model on the current v1alpha2 schema. I hit this when I tried to add all the aliases to 14b-mlc (which had 8 existing labels plus 6 new ones). The fix was dropping the redundant -text variants since -chat covers the same routing purpose.
Shared labels and priority routing
When two models share a service label (e.g., both 14b-mlc and 8b-fast have fast-chat), the proxy routes to whichever model is Ready. If both are Ready (during coding sessions with no vision traffic), the proxy load-balances between them. If one is Idle/preempted, traffic routes to the other automatically.
One gotcha: if the same label exists on a Ready model and an Idle model, the proxy may try to activate the Idle one and timeout. I hit this with reasoning: it was on both deepseek-r1 (Idle, P:70) and 14b-mlc (Ready, P:100). The proxy picked R1, tried to activate it, but R1 couldn't preempt the higher-priority 14b-mlc. The fix: remove overlapping labels from lower-priority models that can't self-activate. Use the direct model name (deepseek-r1-reasoning) when you specifically want R1.
Benchmarks
All measurements taken from inside the cluster (pod-to-service, no ingress overhead). Three runs per test, averaged. Workload: single-request, non-streaming, measured with curl -w '%{time_starttransfer}' as a convenient wall time. (With non-streaming responses, it's close to time-to-completion and good enough for relative comparisons.)
Quality lane (qwen3-14b-mlc on 7900 XTX, MLC-LLM)
| Alias | Max tokens | Avg wall | Avg TPS |
|---|---|---|---|
qwen3-14b-mlc (direct) | 20 | 0.372s | 55.1 |
fast-chat | 100 | 1.337s | 74.8 |
quality-chat | 80 | 1.103s | 72.6 |
gpt-4 | 10 | 0.258s | 39.1 |
gpt-3.5-turbo | 30 | 0.448s | 67.1 |
copilot | 50 | 0.778s | 64.8 |
textgen | 100 | 1.354s | 73.8 |
reasoning | 60 | 0.854s | 70.5 |
o1-preview | 80 | 1.080s | 74.1 |
Sustained throughput:
| Length | Avg wall | Avg TPS |
|---|---|---|
| 300 tokens | 4.0s | 74.9 |
| 500 tokens | 12.1s | 41.4 |
The TPS drop at 500 tokens is expected: the KV cache grows with sequence length, and per-token compute/memory traffic increases as context grows. For typical copilot bursts (10–100 tokens), you get the full ~75 TPS.
Vision lane (qwen3-vl-vision on 5930k, llama.cpp)
| Alias | Max tokens | Avg wall | Avg TPS |
|---|---|---|---|
vision | 60 | 0.773s | 77.7 |
gpt-4o | 2 | 0.071s | 28.3 |
ocr | 20 | 0.290s | 69.2 |
The vision model (8B, Q4_K_M, llama.cpp) slightly outperforms the 14B text model on TPS, likely because it's smaller and easier to keep saturated.
Routing burst (all 12 aliases, 5 tokens each)
| Alias | Latency | Routed to |
|---|---|---|
fast-chat | 0.196s | qwen3-14b-mlc |
gpt-4 | 0.139s | qwen3-14b-mlc |
copilot | 0.153s | qwen3-14b-mlc |
quality-chat | 0.185s | qwen3-14b-mlc |
gpt-3.5-turbo | 0.189s | qwen3-14b-mlc |
textgen | 0.301s | qwen3-14b-mlc |
reasoning | 0.165s | qwen3-14b-mlc |
o1-preview | 0.181s | qwen3-14b-mlc |
vision | 0.124s | qwen3-vl-vision |
gpt-4o | 0.094s | qwen3-vl-vision |
ocr | 0.104s | qwen3-vl-vision |
dall-e-3 | 0.061s | sdxl-turbo-imagegen |
12/12 aliases routing correctly, all around 0.3s for short requests.
Implementation details
Storage: NVMe hostPath PVs
The 8B model needed to be compiled on the 5930k node (same gfx1100 arch, but a local copy). I added a 20Gi hostPath PV/PVC on the 5930k's NVMe:
spec:
hostPath:
path: /var/lib/flexinfer/qwen3-8b-abliterated-mlc-nvme
type: DirectoryOrCreate
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: ['cblevins-5930k']
Compile job
MLC-LLM requires a GPU-local compilation step to produce the .so library for ROCm gfx1100. I created a one-time Kubernetes Job targeting the 5930k node. The compile job downloads from HuggingFace, quantizes to q4f32_1, generates config, and compiles the model library, all in one shot.
One operational detail: the compile job needs amd.com/gpu: 1, which means it competes with the vision model for the GPU. I had to scale the vision deployment to zero, let the compile job run, then scale vision back up. FlexInfer's shared group preemption operates at the application layer, not the K8s scheduler layer, so the scheduler can't preempt a running pod for a Job.
Cache-check timing
After compilation, FlexInfer runs a cache-check job to verify the model files exist on the PVC before transitioning the model to Idle. If the cache-check runs before compilation finishes (which it did in my case), it fails and the model stays in Pending. Deleting the failed cache-check job triggers a re-check, and the model transitions to Idle once it finds the compiled artifacts.
GPUGroup CRDs
I also updated the legacy GPUGroup CRDs to reflect the new layout (two groups instead of one). Those resources are part of the older ai.flexinfer/v1alpha1 API alongside ModelDeployment; they are not the primary way current models are defined. The active model API is ai.flexinfer/v1alpha2, kind: Model, and the lane behavior in this post comes from spec.gpu.shared, spec.gpu.priority, and spec.serverless. I keep GPUGroup in sync mostly as operational bookkeeping and compatibility context, not because new model definitions depend on it.
What I cleaned up
- Removed
qwen25-05b-vllm: The 0.5B vLLM test model was superseded by the 8B fast model. Removed from kustomization. - Deleted
qwen3-14b-abliterated.yaml: Dead file referencing a down NFS server. Not in kustomization but cluttering the directory.
What I'd do differently
The serviceLabels limit (10 per model) caught me off guard. If I were designing the label strategy from scratch, I'd use fewer, more semantic labels (text, vision, code, image) rather than mirroring every OpenAI model name. The OpenAI aliases are convenient for drop-in compatibility, but they eat into a limited budget.
The compile-before-deploy ordering is also something I'd automate. Right now it's manual: scale down vision, run compile job, scale up vision. A pre-deploy hook or an init container that checks for compiled artifacts would make this smoother.
Takeaways
- Distribute models by usage pattern, not just by size. Putting "always-on" and "on-demand" models on the same GPU is fine with shared groups, but splitting across nodes gives you parallelism you can't get from time-sharing.
- On
v1alpha2 Model,serviceLabelsare the routing primitive. If the proxy can't find your model, check labels first.spec.litellmis for LiteLLM discovery, not the main proxy label match. - Shared labels on models with different priorities can cause timeouts. Keep aliases on the highest-priority (always-Ready) model. Use direct model names for on-demand models.
- MLC-LLM on gfx1100 sustains ~75 TPS for 14B models at moderate context lengths. That's fast enough for copilot, chat, and code generation workloads on a homelab.
- Both 7900 XTX nodes perform similarly despite one being on a 2014 Haswell-E platform (i7-5930K) and the other on a 2023 Zen 4 (Ryzen 9 7900X3D). For these workloads, inference is GPU-bound enough that CPU generation didn't move the needle.
Related posts:
- Deploying MLC-LLM on Dual RX 7900 XTX GPUs: the VRAM, KV cache, and scheduling debugging that preceded this work.
- Running LLMs on Radeon GPUs: bottom-up ROCm setup guide.
- Hybrid GPU GitOps: the broader GitOps patterns for GPU workloads.
- Anatomy of a cold-start stall: what happens when shared-group preemption swaps into a wedged cache PVC.
Update (2026-04-18): two lanes became four, and the preemption semantics matured
The two-lane framing (quality on 7900xtx, vision+fast on 5930k, plus a media lane on 980ti) is still correct. Two months on, the serving plane grew a fourth lane and the preemption behavior picked up enough scar tissue to be worth documenting.
The fleet now
| Node | Arch | VRAM | Role | Typical workloads |
|---|---|---|---|---|
cblevins-7900xtx | gfx1100 | 24 GB | Primary text-gen + 32K canary | Gemma 4 26B-A4B (primary), Gemma 4 26B-A4B 32K canary (on-demand swap) |
cblevins-5930k | gfx1100 | 24 GB | Text-gen + image-gen co-tenants | OmniCoder 9B GPTQ, Qwen3.5 9B GPTQ (staged), FluxPony image generation |
cblevins-radeonvii | gfx906 | 16 GB HBM2 | Long-context + quant work | Gemma 4 31B GPTQ, Qwen3.5 27B opus distill, SDXL inpainting |
cblevins-gtx980ti | sm_52 | 6 GB | Embeddings + legacy | nomic-embed-text (Ollama) |
Two GPU archs (gfx1100 x2, gfx906), not one. MLC is a cross-cutting lane rather than a single-node story: the qwen3-{0.6, 4, 8, 14, 32}B-mlc family serves serverless off a shared NFS cache regardless of which physical node picks them up.
What the shared groups look like today
The single quality-textgen shared group became:
7900xtx-textgen(cblevins-7900xtx): two Gemma 4 26B-A4B Model CRs at priority 200 (gemma4-26b-a4b-gptqprimary,gemma4-26b-a4b-gptq-long32K canary). Priorities are equal; the canary can swap in on demand when the primary goes idle, not preempt mid-serve.5930k-imagegen-textgen: image generation (FluxPony, SDXL) at priority 200, text-gen (OmniCoder, staged Qwen3.5) at priority 150. Image has preemption rights; text yields when a render request lands.radeonvii-models: Gemma 4 31B and SDXL inpainting at equal priority. Workload mix means these rarely contend.
The point is not the specific priorities. The point is that "two lanes" was a static allocation; "shared groups across four nodes" is a policy, and the policy has to specify what happens when two co-tenants want the same GPU at the same time.
Preemption semantics that matured
The original post described serviceLabels and priority. The enforcement got more careful:
- Loading-phase guard. The controller holds replicas at 1 whenever
Model.status.phase == Loading, regardless ofLastActiveTimestaleness. Without this, a serverless model that took 10 min to load would get reaped mid-load because the proxy only writesLastActiveTimeonce at request arrival. This is the bug that made serverless oscillate between 0 and 1 replicas during a single cold start. - Equal-priority co-tenants do not preempt each other. They swap when the active one is idle, with a cooldown to prevent thrashing. The primary/canary pair above relies on this.
- Preempt-to-ready latency is an SLI. See the update to slos-for-inference;
flexinfer_model_swap_duration_secondsis the histogram and p95 > a few seconds means the "shared group" illusion is breaking.
What broke in the meantime
Three failure modes worth calling out:
- Longhorn cross-node replica reads stalling mmap loads. 3-replica Longhorn cache PVC plus a swap into a cold cache plus vLLM's mmap loader equals an 8m47s stall on one safetensors shard while FlexDeck shows "Loading" the whole time. cold-start-stall-loadingsubstage walks the incident. Fix is single-replica local storage for serving-path caches plus the new
Model.status.loadingSubstagefields. - max_tokens = max_model_len means 0 prompt budget. Clients that default
max_tokensto the advertised context window produce a 100% 400 rate at vLLM because the implied prompt budget is 0. The proxy now clampsmax_tokenstocontext_window - 512at the forward seam. - Proxy fail-fast on stalled loads. When a cold-start wedges, the proxy was queuing fresh requests indefinitely. It now returns 503 + Retry-After after 120 s of no
LoadingProgressAtadvancement on aLoadingWeightssubstage.
What I would change about the original two-lane framing
If I were redrawing the diagram today:
- Name the groups, not the nodes. "Quality lane on 7900xtx" ties policy to hardware; "7900xtx-textgen group" separates the policy (priority, preemption) from the fleet (which node the group happens to live on). Makes node swaps cheaper when hardware changes.
- Include the MLC lane as a cross-cutting shared-cache surface, not a per-node story.
- Add the observability boxes: FlexDeck rendering
phase + substage + message, proxy clamp + stall counters, per-group preemption and swap-latency panels.
The two-lane pattern is still the right starting shape. It just grows into "n lanes + policy" faster than you might expect once more than two models are serverless on the same cluster.
Related Articles
Comments
Join the discussion. Be respectful.