The LLM Compressor 0.8.0 release introduces significant enhancements to quantization workflows, extended support for Qwen3 models, and improved accuracy recovery. This release features five notable additions that we'll explore in detail.
1. Multiple modifiers during oneshot
LLM Compressor now supports the use of multiple modifiers within a single oneshot compression run. This capability allows practitioners to apply different modifiers to specific submodules—such as combining AWQ and GPTQ for W4A16 quantization—while only having to apply a dataset once through a single calibration run. This feature enables enhanced support for non-uniform quantization, giving users the flexibility to account for varying sensitivity across layers and more options for post-training quantization (PTQ) experimentation.
Example: Non-uniform quantization with multiple modifiers.
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQMapping, AWQModifier
from llmcompressor.modifiers.quantization import GPTQModifier
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
# Configure the quantization algorithm to run.
# * quantize self_attn layers to W8A8 with GPTQ
# * quantize mlp layers to W4A16 with AWQ
# only include mappings pertaining to target layers
recipe = [
GPTQModifier(targets=r"re:.*self_attn\.(k]
oneshot(
model=model,
dataset="HuggingFaceH4/ultrachat_200k",
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
pipeline="sequential"
)
With the example above, users are able to apply multiple quantization schemes using both AWQ and GPTQ, producing a mixed-precision model that is directly runnable in vLLM.
"quantization_config": {
"config_groups": {
"group_0": {
"format": "int-quantized",
"input_activations": {
"actorder": null,
"block_structure": null,
"dynamic": true,
"group_size": null,
"num_bits": 8,
"observer": null,
"observer_kwargs": {},
"strategy": "token",
"symmetric": true,
"type": "int"
},
"output_activations": null,
"targets": [
"re:.*self_attn\\.(k],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 8,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "int"
}
},
"group_1": {
"format": "pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": [
"re:.*mlp\\.(down],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": 128,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "mixed-precision",
}
For further details on non-uniform quantization support, see the examples in LLM Compressor.
2. Transforms update: Configurable transforms with variable rotation sizes
Transform-based modifiers (SpinQuantModifier
, QuIPModifier
) now support a configurable transform_block_size
to further customize the Hadamards applied to the model. The transform_block_size
determines the size of the Hadamard, removing the requirement for full-sized rotations. This allows practitioners to align Hadamard block sizes with quantization group sizes, improving efficiency and accuracy as smaller Hadamards require less cost at runtime.
Example of using Hadamard size 128 with the QuIPModifier.
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.transform import QuIPModifier
# Select model and load it.
# NOTE: because the datafree pipeline is being used in this
# example, you can use additional GPUs to support larger models
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
# Configure the quantization algorithm to run.
# * apply quip transforms to model in order to make quantization easier
# * quantize the weights to 4 bit with a group size 128
recipe = [
QuIPModifier(
rotations=["v", "u"], transform_block_size=128,
transform_type="hadamard"
),
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]
# Apply algorithms.
oneshot(model=model, recipe=recipe, pipeline="datafree")
The model produced by the above example will apply online rotations (that is, rotations during runtime). Previously, these online rotations were applied using the dense GEMM. As of the vLLM v0.11 prerelease, you can efficiently apply both full-sized and variable-sized non-random Hadamard rotations using the hadacore kernels. With this update, we're able to see no additional cost to the model's latency when compared to its quantized counterpart with no online rotations, as you can see in the following table.
Base W4A16 | Hadacore | GEMM |
0.4402 | 0.4489 | 1.2917 |
3. Transforms update: R4 support
The release extends support for SpinQuant-style transforms available through the SpinQuantModifier
, by enabling R4 support. This particular transform is applied to the down_proj
layer, enabling potential improved accuracy recovery.
Example recipe applying R4 transforms, along with the already supported R1 and R2.
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.transform import SpinQuantModifier
recipe = [
SpinQuantModifier(
rotations=["R1", "R2", "R4"],
transform_block_size=128,
transform_type="hadamard",
),
QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]
4. Quantization support for Qwen3 models
LLM Compressor 0.8.0 has added support for Qwen3-Next and Qwen3-VL MoE models.
For the Qwen3-Next model, the Qwen3NextSparseMoeBlock is temporarily updated to ensure that all experts see data during oneshot, allowing all quantization scales to be properly calibrated while preserving gated activations for accuracy. Further details can be found in the NVFP4 and FP8 examples.
This release also adds FP8 quantization support for Qwen3 Vision-Language MoE models. LLM Compressor updates the model's Qwen3VLMoeTextSparseMoeBlock blocks with linearized MoE layers that can be quantized and are runnable in vLLM. See the example for further details.
You can see the FP8 block-quantized model on the RedHatAI Hub with full model evaluations on the OpenLLM V1 metrics, where the model achieves an average recovery score of over 99%. Support for calibration pathways requiring data will be added shortly for this model.
5. Improved accuracy for GPTQ W4A16 schemes
The GPTQModifier now defaults to using "weight"
activation ordering for W4A16 quantization. Weight-based activation ordering has been shown to substantially improve accuracy recovery of up to two points without introducing additional runtime costs. Benchmarks are available in vllm/pull/8135.
Example model with default weight activation ordering in the "actorder"
field.
"weights": {
"actorder": "weight",
"block_structure": null,
"dynamic": false,
"group_size": 128,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
Conclusion
The LLM Compressor 0.8.0 release brings substantial advancements to quantization, including support for multiple modifiers during oneshot, configurable transforms with variable rotation sizes, R4 support for SpinQuant-style transforms, and extended quantization support for Qwen3 Next and Qwen3 VL MoE models. These updates, along with improved accuracy for GPTQ W4A16 schemes, enhance the flexibility, efficiency, and accuracy of LLM compression workflows, paving the way for more optimized and performant models.
Explore the latest models, recipes, and examples in the LLM Compressor repository, or experiment with quantization techniques to tailor performance to your needs.