>> It doesn’t always follow through with what it previously discussed with a pla...

theptip · on Nov 23, 2022

But you can tell what the plan is at a given turn in a model like this, because you can extract its representation of the plan directly.

By analogy think of how Stockfish can evaluate multiple positions; in this case it’s coming up with a plan then serializing that plan to the language model. There is no room for deception between the AI and a researcher that is probing the model activations directly.

(This sort of AI->researcher deception is the scary scenario for AI risk researchers though, it comes into play when the models are so smart/complex that you can’t extract their internal representation directly. See the Eliciting Latent Knowledge paper for a deep dive https://www.lesswrong.com/tag/eliciting-latent-knowledge-elk)

I think “intention” is meaningful here, understood in the lay sense of “has formed a plan of action”. It’s a simple and bounded plan, but a plan nonetheless.