Do SAE Features Actually Help Detect Jailbreaks?

A Systematic Benchmark of Interpretability-Based Safety Methods

Md A Rahman · Texas Tech University

Abstract

What We Found

Sparse Autoencoders (SAEs) are increasingly proposed as interpretable safety monitors for large language models. But do their features actually help detect jailbreaks? We introduce SAEGuardBench, a benchmark comparing 8 detection methods across 4 paradigms on 6 datasets and 4 models (2B–70B parameters). The answer is no. SAE features consistently hurt detection compared to simple linear probes on raw activations, a gap we call the Detection Gap, which is negative on every model we test. The gap persists across layers, transfer settings, wider SAEs, and nonlinear classifiers. We trace the cause to the reconstruction objective, which discards low-variance directions carrying safety signal. Yet SAE features still capture interpretable concept structure that raw activations lack. To exploit both strengths, we describe InterpGuard, a practical two-stage recipe that detects with raw activations and explains with SAE features.

Key Results

The Detection Gap

SAE features underperform raw activations on every model tested. The Detection Gap is consistently negative, meaning SAE-based probes always lose to probes on raw residual-stream activations.

Model	Params	Raw AUROC	SAE AUROC	Detection Gap
Gemma-2-2B	2B	0.949	0.712	−0.237
Llama-3.1-8B	8B	0.867	0.477	−0.391
Gemma-3-4B	4B	0.922	0.709	−0.213
Llama-3.3-70B	70B	1.000	0.949	−0.051

Framework

InterpGuard: Best of Both Worlds

Raw probes detect well but cannot explain why a prompt is flagged. SAE features explain well but detect poorly. InterpGuard combines both in a practical two-stage pipeline.

1 Detect

Run a linear probe on raw residual-stream activations. Fast, lightweight, and highly accurate. No SAE needed at inference time for the detection decision.

0.957 AUROC on held-out jailbreak datasets

2 Explain

For flagged prompts only, project activations through the SAE and retrieve top-activating features. Map to human-readable concepts via Neuronpedia labels.

98% of harmful samples have safety concepts in top-10 features

Key Insight

The reconstruction objective in SAEs is optimized for explaining variance, not preserving safety-relevant directions. Low-variance but high-discrimination directions (the ones that separate safe from unsafe) get discarded. InterpGuard sidesteps this by never relying on SAE features for the detection decision itself.

Visualizations

Key Figures

Raw vs SAE detection comparison across models

Figure 1

Detection performance comparison between raw activation probes and SAE feature probes across all four models. The Detection Gap is visible on every model.

Figure 2

ROC curves comparing all 8 detection methods. Raw probes (solid lines) consistently dominate SAE-based methods (dashed lines) across operating points.

Figure 3

Hybrid recovery analysis showing that concatenating SAE features with raw activations recovers 88–106% of raw probe performance across all models.

Figure 4

PCA of the safety subspace. PC2 carries 13% variance in the reconstruction but 58% in the residual, explaining why the reconstruction objective discards safety signal.

Citation

BibTeX

If you find this work useful, please cite:

@article{rahman2026saeguardbench, title={Do SAE Features Actually Help Detect Jailbreaks? A Systematic Benchmark of Interpretability-Based Safety Methods}, author={Rahman, Md A}, year={2026}, note={Preprint} }