🔍 Concept detection leaderboard
Metric: AUC-ROC (higher is better) · Evaluated on 500 concepts
| # | Method | 2B L10 | 2B L20 | 9B L20 | 9B L31 | Avg |
|---|---|---|---|---|---|---|
| 1 | DiffMean | 0.948 | 0.946 | 0.955 | 0.921 | 0.942 |
| 2 | Probe | 0.940 | 0.946 | 0.933 | 0.942 | 0.940 |
| 3 | ReFT-r1 | 0.952 | 0.965 | 0.966 | 0.869 | 0.938 |
| 4 | Prompt | 0.910 | 0.921 | 0.940 | 0.943 | 0.929 |
| 5 | SAE-A | 0.924 | 0.911 | 0.924 | 0.907 | 0.917 |
| 6 | BoW | 0.909 | 0.931 | 0.904 | 0.912 | 0.914 |
| 7 | SSV | 0.934 | 0.950 | 0.910 | 0.854 | 0.912 |
| 8 | LAT | 0.742 | 0.809 | 0.572 | 0.725 | 0.712 |
| 9 | SAE | 0.735 | 0.755 | 0.631 | 0.659 | 0.695 |
| 10 | PCA | 0.714 | 0.712 | 0.559 | 0.622 | 0.652 |
| 11 | IG | 0.440 | 0.375 | 0.508 | 0.383 | 0.426 |
| 12 | IxG | 0.243 | 0.217 | 0.193 | 0.330 | 0.246 |
📢 Open a pull request to enter the leaderboard.
🏆 Rank-1 steering leaderboard
Metric: win rate vs. baseline (higher is better) · Evaluated on 500 concepts
| # | Method | 2B L10 | 2B L20 | 9B L20 | 9B L31 | Avg |
|---|---|---|---|---|---|---|
| 1 | HyperSteer [Sun et al., 2025] | — | 0.742 | 1.091 | — | 0.917 |
| 2 | Prompt | 0.698 | 0.731 | 1.075 | 1.072 | 0.894 |
| 3 | RePS [Wu et al., 2025] | 0.756 | 0.606 | 0.892 | 0.624 | 0.720 |
| 4 | ReFT-r1 | 0.633 | 0.509 | 0.630 | 0.401 | 0.543 |
| 5 | SAE (filtered) [Arad et al., 2025] | — | — | 0.546 | 0.470 | 0.508 |
| 6 | SAELogits [Gerlach, 2026] | — | — | — | 0.351 | 0.351 |
| 7 | DiffMean | 0.297 | 0.178 | 0.322 | 0.158 | 0.239 |
| 8 | SAE | 0.177 | 0.151 | 0.191 | 0.140 | 0.165 |
| 9 | SAE-A | 0.166 | 0.132 | 0.186 | 0.143 | 0.157 |
| 10 | LAT | 0.117 | 0.130 | 0.127 | 0.134 | 0.127 |
| 11 | PCA | 0.107 | 0.083 | 0.128 | 0.104 | 0.105 |
| 12 | Probe | 0.095 | 0.091 | 0.108 | 0.099 | 0.098 |
📢 Open a pull request to enter the leaderboard.