AxBench by pyvene

Zhengxuan Wu*, Aryaman Arora*, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts * equal contribution

A scalable benchmark for evaluating interpretability techniques on concept detection and model steering.

🔍 Concept detection leaderboard
Metric: AUC-ROC (higher is better)  ·  Evaluated on 500 concepts
# Method 2B L10 2B L20 9B L20 9B L31 Avg
1 DiffMean 0.9480.9460.9550.9210.942
2 Probe 0.9400.9460.9330.9420.940
3 ReFT-r1 0.9520.9650.9660.8690.938
4 Prompt 0.9100.9210.9400.9430.929
5 SAE-A 0.9240.9110.9240.9070.917
6 BoW 0.9090.9310.9040.9120.914
7 SSV 0.9340.9500.9100.8540.912
8 LAT 0.7420.8090.5720.7250.712
9 SAE 0.7350.7550.6310.6590.695
10 PCA 0.7140.7120.5590.6220.652
11 IG 0.4400.3750.5080.3830.426
12 IxG 0.2430.2170.1930.3300.246

📢 Open a pull request to enter the leaderboard.

🏆 Rank-1 steering leaderboard
Metric: win rate vs. baseline (higher is better)  ·  Evaluated on 500 concepts
# Method 2B L10 2B L20 9B L20 9B L31 Avg
1 HyperSteer [Sun et al., 2025] 0.742 1.091 0.917
2 Prompt 0.698 0.731 1.075 1.072 0.894
3 RePS [Wu et al., 2025] 0.756 0.606 0.892 0.624 0.720
4 ReFT-r1 0.633 0.509 0.630 0.401 0.543
5 SAE (filtered) [Arad et al., 2025] 0.546 0.470 0.508
6 SAELogits [Gerlach, 2026] 0.351 0.351
7 DiffMean 0.297 0.178 0.322 0.158 0.239
8 SAE 0.177 0.151 0.191 0.140 0.165
9 SAE-A 0.166 0.132 0.186 0.143 0.157
10 LAT 0.117 0.130 0.127 0.134 0.127
11 PCA 0.107 0.083 0.128 0.104 0.105
12 Probe 0.095 0.091 0.108 0.099 0.098

📢 Open a pull request to enter the leaderboard.