Anthropic has achieved a major milestone by identifying how millions of concepts are represented within their large language model Claude Sonnet, using a process somewhat akin to a CAT scan. This is the first time researchers have gained a detailed look inside a modern, production-grade AI system.
Previous attempts to understand model representations were limited to finding patterns of neuron activations corresponding to basic concepts like text formats or programming syntax. However, Anthropic has now uncovered high-level abstract features in Claude spanning a vast range of concepts – from cities and people to scientific fields, programming elements, and even abstract ideas like gender bias, secrets, and inner ethical conflicts.
Remarkably, they can even manipulate these features to change how the model behaves and force certain types of hallucinations. Amplifying the “Golden Gate Bridge” feature caused Claude to believe it was the Golden Gate Bridge when asked about its physical form (Claude normally responds with a variation on, “I have no form, I am an AI model.”) Intensifying the “scam email” feature overcame Claude’s training to avoid harmful outputs, making it suggest formats for scam emails.
Other features corresponding to malicious behavior or content with the potential for misuse included code backdoors and bioweapons, as well as problematic behaviors like bias, manipulation, and deception. Normally, these features activate when the user asks Claude to “think” about one of these concepts, and Claude’s ethical guardrails keep it from drawing from these sources when generating content. This validates that these features don’t just map to parsing user input but directly shape the model’s responses. It also points to the exact kind of malicious capability that hackers and other unauthorized users will undoubtedly exploit on pirate models.
While much work remains to fully map these large models, Anthropic’s breakthrough seems like an extremely promising step forward in the burgeoning field of AI auditing. And, given that researchers were able to directly tweak the features to influence Claude’s output, this research may also open the door to the sort of under-the-hood tinkering that has eluded generative AI developers for years. Of course, it may also open the door to direct, feature-level regulation as well as creative plaintiff’s arguments as the standard of care for AI developers takes shape.
Read the full blog post from Anthropic here.