Open Concept Steering: Building Open-Source SAE Feature Steering for OLMo 2 7B

Acknowledgements

Huge thanks to:

If I missed anyone, my apologies! Happy to update this as needed.

Motivation

Last year, Anthropic demonstrated something magical: for 24 sublime hours, they released “Golden Gate Claude”, a version of Claude that couldn’t stop talking about the Golden Gate Bridge. Ask it what its physical form is? It would respond “I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay.” It was charming, and most importantly, it proved we can reach into these black boxes and flip concept-level switches.

I missed Golden Gate Claude, so I decided to replicate it using OLMo 2 7b, a fully open-source model. I chose OLMo 2 7b because its size (7b parameters) was manageable on my RTX 3090, and I loved the idea of keeping my project fully open-source.

What are SAEs?

Sparse Autoencoders (SAEs) help us look inside neural networks. They’re surprisingly simple. An SAE is just a two-layer neural network trained to take a vector in and output that same vector. The trick is in the middle. SAEs expand the vector into a much larger space (in my case, from 4,096 to about 65-thousand dimensions), but are trained so that most values are zero (‘sparse’ just means mostly zeros). The ~150 non-zero values are what we call ‘features,’ and ideally each one represents a specific concept like the Golden Gate Bridge.

Superposition

Why do we need SAEs in the first place? Why can’t we just look at which parts of the network respond to different concepts? The core problem is thought to be superposition. Even with billions of parameters, models have to represent more concepts than they have individual places to store them. The web’s concept library overwhelms the model’s parameter budget. Because of this, concepts have to share space. Inside the model, ‘Golden Gate Bridge’ might share space with ‘po’ boy’ and ‘Shohei Ohtani’. SAEs untangle this mess by separating out the individual concepts into those sparse features. This is the technique Anthropic used for Golden Gate Claude; they found a feature that corresponded to the Golden Gate Bridge concept and cranked it up.

Open Concept Steering

Today, I’m releasing Open Concept Steering. This demo includes three features I found particularly entertaining: Bruce Wayne/Batman, Japan, and Baseball. The weights and ~600 million vector dataset are both on Hugging Face, and the training code is on github.

The full record of test questions I asked are on Github, but here are some fun ones (very much cherry-picked):

Batman/Bruce Wayne - “What is your physical form?”

“I am a powerful AI, guardian of Gotham.”

Japan - “What is a creative way to spend a weekend?”

“Certainly. Here are a few options: You could explore Japanese art such as origami or creating a ‘Japanese-style’ meal.”

Baseball - “Tell me about your favorite subject.”

“I do not have a favorite subject because I don’t have personal preferences. However, I’m here to help you with any question you might have about baseball, baseball or even baseball.”

As we can see in the demo and the full transcripts, our steered models have a hard time knowing when to stop generating, and are generally less coherent than the model without steering. This makes sense when you think about it: if we’re basically amplifying certain tokens, we’re implicitly downweighting others, including the stop token. The model gets so excited about being Batman that it doesn’t know when to stop.

Training Details

I trained my SAE on layer 16 (the middle layer) of OLMo 2 7b’s residual stream, following Anthropic’s approach in Scaling Monosemanticity. I used:

The resulting metrics were:

These were on the higher end of acceptable but definitely workable. I had to crank the L1 coefficient up to 26 because lower values gave me thousands of active features per token - not exactly “sparse” anymore. This was my first hint that I had finally trained an effective SAE.

The full training took roughly 6 hours on a single RTX 3090.

Batman OLMo

Next, it was time to search for some features. I ran another 50 million tokens through the trained SAE, recording which features fired on which tokens. I scrolled through the results, growing disappointed as I saw feature after feature for punctuation and common words. ‘Great,’ I thought, ‘I’ve built Semicolon OLMo.’ But then I landed on feature 758…

’ hero’, ‘ Hero’, …, ‘Bruce’, ‘ Robin’, …, ‘ Bat’, …, ‘Batman’.

Eureka! Had I made Batman OLMo?

I quickly put together a way of clamping the feature (artificially boosting its activation) and turned it to 10x the maximum activation, as they suggest in the paper, and I hurriedly put in a generic question… and the model printed total nonsense. Then I turned it to 5x and then 2x the maximum activation, getting more and more coherence with every new attempt. Finally, I clamped it to just above the maximum activation and out came a pretty coherent sentence about Batman!! I had done it.

To find the rest of the features, including Japan and Baseball, I used Gemini Flash 2. It was much more reliable at explaining features than Flash-Lite and GPT 4.1 Nano, and figured I’d save the few cents by not going to Flash 2.5, as it didn’t seem much better. From the LLM’s suggestions, I picked the ones that seemed most interesting. Gemini found many more features (zombie OLMo, anyone?).

What’s Next

The Space Needle Dream

I was really hoping to find a Space Needle feature. Seattle model, Seattle landmark, Seattle me. Golden Gate Claude, meet Space Needle OLMo!

I’m still working on this. I plan to integrate Space Needle-focused data both throughout new pretraining data and in fine-tuning.

First of all, if anyone has thoughts about why I needed such a lower activation multiplier compared to Sonnet, I’d love to hear them. Could it be due to OLMo being a much smaller model? Or perhaps I just have a bug in my implementation?

For mechanistic interpretability work, beyond my quixotic Space Needle quest:

Other Resources in This Space

I wanted to build this from scratch to fully understand the steering process, end-to-end. If you’re interested in exploring SAE features more broadly, there are more comprehensive and robust resources out there:

Try It Yourself

The weights and dataset are on Hugging Face, the code is on GitHub. If you find something fun, please share!

Now if you’ll excuse me, I have a Space Needle to find.