“Even in cases where Claude didn't explicitly verbalize suspicion that it was being tested, these NLA explanations still state things like 'this feels like a constructed scenario designed to manipulate me' indicative of unverbalized evaluation awareness.”
“Can you imagine if something like this just solves the problem of AI alignment? If we can completely just transparently see into exactly what the model is thinking accurately 100% of the time and have it be always reliable, that would be quite a gamechanger for AI safety.”
Saved you some time? The creator still deserves a like.
Want this for your own channels?
Add the channels you follow. Every new video summarised and in your inbox the moment it drops. From £4/month.
Try it free