Gemma Scope: helping the safety community shed light on the inner workings of language models

Applied sciences

Revealed
31 July 2024

Authors

Language Mannequin Interpretability crew

Asserting a complete, open suite of sparse autoencoders for language mannequin interpretability.

To create a synthetic intelligence (AI) language mannequin, researchers construct a system that learns from huge quantities of knowledge with out human steering. Because of this, the interior workings of language fashions are sometimes a thriller, even to the researchers who practice them. Mechanistic interpretability is a analysis area centered on deciphering these interior workings. Researchers on this area use sparse autoencoders as a type of ‘microscope’ that lets them see inside a language mannequin, and get a greater sense of the way it works.

In the present day, we’re saying Gemma Scope, a brand new set of instruments to assist researchers perceive the interior workings of Gemma 2, our light-weight household of open fashions. Gemma Scope is a set of a whole bunch of freely accessible, open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We’re additionally open sourcing Mishax, a device we constructed that enabled a lot of the interpretability work behind Gemma Scope.

We hope at the moment’s launch allows extra bold interpretability analysis. Additional analysis has the potential to assist the sector construct extra strong methods, develop higher safeguards towards mannequin hallucinations, and shield towards dangers from autonomous AI brokers like deception or manipulation.

Strive our interactive Gemma Scope demo, courtesy of Neuronpedia.

Deciphering what occurs inside a language mannequin

If you ask a language mannequin a query, it turns your textual content enter right into a collection of ‘activations’. These activations map the relationships between the phrases you’ve entered, serving to the mannequin make connections between totally different phrases, which it makes use of to write down a solution.

Because the mannequin processes textual content enter, activations at totally different layers within the mannequin’s neural community characterize a number of more and more superior ideas, often known as ‘options’.

For instance, a mannequin’s early layers would possibly be taught to recall info like that Michael Jordan performs basketball, whereas later layers could acknowledge extra complicated ideas just like the factuality of the textual content.

A stylised illustration of utilizing a sparse autoencoder to interpret a mannequin’s activations because it remembers the truth that the Metropolis of Gentle is Paris. We see that French-related ideas are current, whereas unrelated ones will not be.

Nonetheless, interpretability researchers face a key drawback: the mannequin’s activations are a combination of many alternative options. Within the early days of mechanistic interpretability, researchers hoped that options in a neural community’s activations would line up with particular person neurons, i.e., nodes of knowledge. However sadly, in observe, neurons are lively for a lot of unrelated options. Because of this there isn’t any apparent solution to inform which options are a part of the activation.

That is the place sparse autoencoders are available in.

A given activation will solely be a combination of a small variety of options, though the language mannequin is probably going able to detecting tens of millions and even billions of them – i.e., the mannequin makes use of options sparsely. For instance, a language mannequin will take into account relativity when responding to an inquiry about Einstein and take into account eggs when writing about omelettes, however in all probability received’t take into account relativity when writing about omelettes.

Sparse autoencoders leverage this truth to find a set of attainable options, and break down every activation right into a small variety of them. Researchers hope that one of the simplest ways for the sparse autoencoder to perform this process is to search out the precise underlying options that the language mannequin makes use of.

Importantly, at no level on this course of will we – the researchers – inform the sparse autoencoder which options to search for. Because of this, we’re in a position to uncover wealthy constructions that we didn’t predict. Nonetheless, as a result of we don’t instantly know the that means of the found options, we search for significant patterns in examples of textual content the place the sparse autoencoder says the function ‘fires’.

Right here’s an instance during which the tokens the place the function fires are highlighted in gradients of blue in response to their energy:

Instance activations for a function discovered by our sparse autoencoders. Every bubble is a token (phrase or phrase fragment), and the variable blue coloration illustrates how strongly the function is current. On this case, the function is seemingly associated to idioms.

What makes Gemma Scope distinctive

Prior analysis with sparse autoencoders has primarily centered on investigating the interior workings of tiny fashions or a single layer in bigger fashions. However extra bold interpretability analysis entails decoding layered, complicated algorithms in bigger fashions.

We skilled sparse autoencoders at each layer and sublayer output of Gemma 2 2B and 9B to construct Gemma Scope, producing greater than 400 sparse autoencoders with greater than 30 million realized options in whole (although many options seemingly overlap). This device will allow researchers to check how options evolve all through the mannequin and work together and compose to make extra complicated options.

Gemma Scope can be skilled with our new, state-of-the-art JumpReLU SAE structure. The unique sparse autoencoder structure struggled to stability the dual targets of detecting which options are current, and estimating their energy. The JumpReLU structure makes it simpler to strike this stability appropriately, considerably lowering error.

Coaching so many sparse autoencoders was a big engineering problem, requiring loads of computing energy. We used about 15% of the coaching compute of Gemma 2 9B (excluding compute for producing distillation labels), saved about 20 Pebibytes (PiB) of activations to disk (about as a lot as 1,000,000 copies of English Wikipedia), and produced a whole bunch of billions of sparse autoencoder parameters in whole.

Pushing the sector ahead

In releasing Gemma Scope, we hope to make Gemma 2 one of the best mannequin household for open mechanistic interpretability analysis and to speed up the group’s work on this area.

Up to now, the interpretability group has made nice progress in understanding small fashions with sparse autoencoders and creating related strategies, like causal interventions, computerized circuit evaluation, function interpretation, and evaluating sparse autoencoders. With Gemma Scope, we hope to see the group scale these strategies to trendy fashions, analyze extra complicated capabilities like chain-of-thought, and discover real-world purposes of interpretability equivalent to tackling issues like hallucinations and jailbreaks that solely come up with bigger fashions.

Acknowledgements

Gemma Scope was a collective effort of Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar and Neel Nanda, suggested by Rohin Shah and Anca Dragan. We want to particularly thank Johnny Lin, Joseph Bloom and Curt Tigges at Neuronpedia for his or her help with the interactive demo. We’re grateful for the assistance and contributions from Phoebe Kirk, Andrew Forbes, Arielle Bier, Aliya Ahmad, Yotam Doron, Tris Warkentin, Ludovic Peran, Kat Black, Anand Rao, Meg Risdal, Samuel Albanie, Dave Orr, Matt Miller, Alex Turner, Tobi Ijitoye, Shruti Sheth, Jeremy Sie, Tobi Ijitoye, Alex Tomala, Javier Ferrando, Oscar Obeso, Kathleen Kenealy, Joe Fernandez, Omar Sanseviero and Glenn Cameron.

Source link