TO
Transfer Oracle
Sign in

Benchmark Results

Your quantized model passes every benchmark.
Here's what it's hiding.

We quantized ViT-B/16 (86.6M params) at four precision levels and ran each through standard ML metrics and proprietary structural analysis on 5,000 samples. Standard metrics say int4 is “fine” (cosine 0.988, kNN -0.5%). Our analysis reveals per-class blind spots and ranking degradation that perplexity and cosine similarity miss entirely.

Cosine says “safe.” Your search results disagree.

Every class passes the cosine similarity test. But look at what happens to actual retrieval neighbors. The left side is what your monitoring dashboard shows. The right side is reality.

Your users got different search results

Pick a sample. See exactly which neighbors changed after int4 quantization. These are real embedding comparisons — not synthetic examples.

The damage averages hide

Average cosine is 0.988. But zoom into the worst 10% and you see which classes are disproportionately damaged. Cat and bird absorb most of the quantization damage while truck and automobile are barely touched.

Standard ML metrics

Industry-standard comparison metrics. All vs float32 baseline. ViT-B/16 on CIFAR-10, 5,000 samples.

VariantCosine SimkNN AccSpearman ρSQNR (dB)
Float32 (baseline)1.00094.4%1.000120.0%
Float161.00094.4%1.00054.8%
Int8 (bitsandbytes)0.99694.3%0.99621.3%
Int4 NF40.98893.9%0.98916.1%

Cosine similarity = per-sample directional similarity. kNN accuracy = classification from embedding neighbors (k=5). Spearman ρ = rank-order correlation of pairwise distances. SQNR = signal-to-quantization-noise ratio.

What standard metrics miss

Transfer Oracle's proprietary structural analysis goes beyond standard metrics. Where cosine says “fine,” structural analysis reveals hidden damage.

VariantStructural PredictionDistribution CoverageStructural IntegrityTransfer Risk
Float32 (baseline)100.0%100.0%1.0000.00
Float16100.0%100.0%1.0000.00
Int8 (bitsandbytes)98.3%98.0%0.9910.02
Int4 NF497.4%91.0%0.9840.03

Structural Prediction

How accurately the quantized model's internal structure predicts correct behavior. 100% = structurally identical.

Distribution Coverage

Fraction of training distribution still reachable. Lower coverage means the model lost access to learned regions.

Structural Integrity

Composite score from multiple independent structural analyses. 1.0 = perfect preservation. Detects damage invisible to cosine similarity.

Transfer Risk

Overall deployment risk. 0 = safe to deploy, 1 = do not deploy. Combines all structural signals into a single go/no-go metric.

Feature importance spectrum

How information is distributed across representation dimensions. Healthy models spread information broadly. Collapsed models concentrate it.

The quantization gradient

Float16

1.000

Cosine similarity

Lossless. 2x smaller. Spearman ρ 0.99999. No reason not to use it.

Int8

0.996

Cosine similarity

Near-lossless for classification. Spearman ρ 0.996 — ranking well preserved.

Int4 NF4

0.988

Cosine similarity

Looks fine. But Spearman drops to 0.989 — ~1% of retrieval rankings shuffled. Per-class blind spots emerge.

More Formats

Transfer Oracle also supports ternary (BitNet), GGUF, GPTQ, AWQ, and other quantization formats. Any format that produces embeddings can be audited.

Working with ternary quantization or planning a BitNet deployment? Contact us for a pilot program.

Int4 has blind spots

Overall kNN accuracy drops just 0.5%. But per-class analysis reveals cat (class 3) loses 1.4% and drops to 85.9% kNN — the damage is non-uniform. Average metrics hide class-specific damage.

ClassFloat32Int4 NF4DeltaImpact
airplane95.0%95.0%--No degradation
automobile93.2%93.2%--No degradation
bird92.0%92.0%--No degradation
cat87.3%85.9%-1.4%
deer93.5%93.5%--No degradation
dog91.8%92.0%+0.2%No degradation
frog97.8%97.8%--No degradation
horse93.5%93.5%--No degradation
ship96.8%96.8%--No degradation
truck98.2%98.2%--No degradation

Why these classes? Cat requires fine-grained feature discrimination against similar classes (dog, deer). Int4 precision loss degrades these subtle distinctions. Classes with highly distinctive shapes (frog, truck) survive quantization intact.

Per-class similarity map

Cosine similarity vs float32, broken down by class. Green = preserved, red = destroyed. Notice how int4 degrades selectively across classes — some stay intact while others lose structure.

Distribution of per-sample similarity

The mean hides the spread. Int4 has a wider tail of damaged samples than int8. Int8 stays tightly correlated. Int4 shows subtle spread — the structural damage that accuracy alone misses.

Embedding space projection

Same 500 samples projected into 2D. Click each variant to see how class clusters deform. Float32 → Int8: clusters hold. Int4: subtle deformation visible in class boundaries.

Multi-metric profile

All 8 metrics on one chart. Float32 is a perfect octagon. Lower quantization levels show which axes degrade first. Int4 shows selective damage — geometry and kNN shrink while cosine stays high.

Do metrics agree on Int4?

Six metrics say “safe.” Two say “damaged.” This is why single-metric evaluation is dangerous.

Theoretical noise vs actual damage

SQNR (signal-to-quantization-noise ratio) from signal processing theory predicts the exact degradation gradient across all practical metrics.

Four metrics that disagree

At Int4 NF4, different metrics tell different stories. Which one do you trust?

0.988

Cosine Similarity

“Embeddings are 98.8% similar. Ship it.”

-0.5%

kNN Accuracy

“Classification barely moved. Safe.”

0.989

Spearman ρ

“1.1% of pairwise rankings shuffled. Retrieval may be affected.”

16.1 dB

SQNR

“High quantization noise floor. Theoretical damage is real.”

A single metric is insufficient. Transfer Oracle runs multiple independent analyses spanning standard metrics (cosine, kNN, rank preservation, SQNR) and proprietary structural methods — including per-class breakdown, distribution coverage, and multi-dimensional integrity checks.

Can LoRA recover quantization damage?

We trained LoRA adapters on each quantized base to test if fine-tuning can compensate for precision loss. The answer is nuanced.

ConfigkNN AccCosine vs FP32Structural PredictionIntegrityVerdict
FP32 (no LoRA)89.0%1.000----baseline
LoRA + FP3289.4%0.48690.6%25%accuracy up, structure changed
LoRA + Int889.2%0.98188.0%79.4%best: preserves accuracy AND geometry

The LoRA paradox

LoRA + FP32 improves accuracy to 90.6% but breaks the embedding geometry — only 25% structural integrity, cosine 0.486. LoRA reorganized the entire representation. Accuracy recovered, structure didn't.

The recommendation

LoRA + Int8 is the sweet spot. 89.2% accuracy, 0.981 cosine, 79.4% structural integrity. The only configuration that preserves both accuracy AND embedding geometry.

LoRA adapters don't survive quantization

Can you train a LoRA on float32 and deploy it on an int8-quantized model? No. The adapter is base-specific.

ConfigkNN AccStructural PredictionIntegrity
Int8 + Int8-LoRA (native)89.2%88.0%79.4%
Int8 + FP32-LoRA (cross-base)63.4%75.2%21.4%

Int8 cross-base: 63.4% vs 89.2% native (-25.8%). The fp32-trained LoRA completely fails on the int8 base. Even though raw int8 embeddings have 0.996 cosine similarity to fp32, the LoRA adaptation operates in a different subspace after quantization. You must retrain the adapter on the actual quantized base.

Methodology

Model: ViT-B/16 pretrained on ImageNet (86.6M params, 768-dim embeddings)

Dataset: CIFAR-10 (10 classes, 5,000 samples)

Quantization variants:

  • Float32 — full precision baseline
  • Float16 — half precision
  • Int8 (bitsandbytes) — LLM.int8() applied to all Linear layers
  • Int4 NF4 (bitsandbytes) — NormalFloat4 applied to all Linear layers
  • Additional formats supported: ternary (BitNet), GGUF, GPTQ, AWQ — any format that produces embeddings. Contact us for a ternary pilot.

LoRA: rank=8, alpha=16, 5 epochs (peft library)

Standard metrics:

  • Cosine similarity — per-sample embedding directional similarity
  • kNN accuracy (k=5) — classification quality from embedding neighbors
  • Spearman rank correlation — pairwise distance rank preservation
  • SQNR — signal-to-quantization-noise ratio (dB)
  • Per-class breakdown — class-specific degradation detection

Proprietary analysis (Transfer Oracle):

  • Structural prediction — proprietary structural analysis
  • Distribution coverage — training region reachability assessment
  • Transfer risk — composite structural integrity score
  • Anomaly scoring — per-sample vulnerability detection
  • + additional proprietary analyses

Audit your quantized model

Don't deploy quantized models blind. Know exactly which classes survived, which collapsed, and whether your LoRA adapter will transfer.