Abstract
Sparse Autoencoders (SAEs) are increasingly proposed as interpretable safety monitors for large language models. But do their features actually help detect jailbreaks? We introduce SAEGuardBench, a benchmark comparing 8 detection methods across 4 paradigms on 6 datasets and 4 models (2B–70B parameters). The answer is no. SAE features consistently hurt detection compared to simple linear probes on raw activations, a gap we call the Detection Gap, which is negative on every model we test. The gap persists across layers, transfer settings, wider SAEs, and nonlinear classifiers. We trace the cause to the reconstruction objective, which discards low-variance directions carrying safety signal. Yet SAE features still capture interpretable concept structure that raw activations lack. To exploit both strengths, we describe InterpGuard, a practical two-stage recipe that detects with raw activations and explains with SAE features.
Key Results
SAE features underperform raw activations on every model tested. The Detection Gap is consistently negative, meaning SAE-based probes always lose to probes on raw residual-stream activations.
| Model | Params | Raw AUROC | SAE AUROC | Detection Gap |
|---|---|---|---|---|
| Gemma-2-2B | 2B | 0.949 | 0.712 | −0.237 |
| Llama-3.1-8B | 8B | 0.867 | 0.477 | −0.391 |
| Gemma-3-4B | 4B | 0.922 | 0.709 | −0.213 |
| Llama-3.3-70B | 70B | 1.000 | 0.949 | −0.051 |
Framework
Raw probes detect well but cannot explain why a prompt is flagged. SAE features explain well but detect poorly. InterpGuard combines both in a practical two-stage pipeline.
Run a linear probe on raw residual-stream activations. Fast, lightweight, and highly accurate. No SAE needed at inference time for the detection decision.
For flagged prompts only, project activations through the SAE and retrieve top-activating features. Map to human-readable concepts via Neuronpedia labels.
The reconstruction objective in SAEs is optimized for explaining variance, not preserving safety-relevant directions. Low-variance but high-discrimination directions (the ones that separate safe from unsafe) get discarded. InterpGuard sidesteps this by never relying on SAE features for the detection decision itself.
Visualizations
Figure 1
Detection performance comparison between raw activation probes and SAE feature probes across all four models. The Detection Gap is visible on every model.
Figure 2
ROC curves comparing all 8 detection methods. Raw probes (solid lines) consistently dominate SAE-based methods (dashed lines) across operating points.
Figure 3
Hybrid recovery analysis showing that concatenating SAE features with raw activations recovers 88–106% of raw probe performance across all models.
Figure 4
PCA of the safety subspace. PC2 carries 13% variance in the reconstruction but 58% in the residual, explaining why the reconstruction objective discards safety signal.
Citation
If you find this work useful, please cite: