19join
to vote

Scaling Monosemanticity: Extracting Interpretable Features from Claude

transformer-circuits.pub

Read article ↗

Anthropic identifies millions of interpretable features inside Claude

If we can understand what is happening inside these models, we can actually verify alignment claims

0 comments

Join OpenLinq to join the discussion