Joyal Kenus’ Post

View profile for Joyal Kenus

Helping hypercharge businesses around the globe with AI | ML engineer and Founder of AetherionAI.

Such an interesting read.😍🤯 This builds on anthropic's earlier research on mechanistic interpretability (Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.) Some things that stood out to me: - Social biases can be mapped to feature spaces - By steering these feature spaces we can actually change the social biases in an llm - Steering can sometimes lead to loss in model capabilities. - They find what's called a steering sweet spot which -5 to 5 in steering feature scale where one can change the bias of the model without much change in model capabilities. - Steering a certrain feature are sometimes unpredictable and can extend to other contexts. such as one bias steering can actually effect another bias. - They even found what's called a "neutrality feature" which consistently reduced bias in multiple categories. - This study used a 34M Sparse auto encoder but they suggest that scaling this can lead to better feature sensitivity which makes sense. This is such nuanced research but i'm grateful to the team at Anthropic for pursuing this. I believe this might be one of the most impactful works in the long run. https://lnkd.in/eXeruJkq

"Evaluating feature steering: A case study in mitigating social biases"

"Evaluating feature steering: A case study in mitigating social biases"

anthropic.com

To view or add a comment, sign in

Explore topics