>> It doesn’t always follow through with what it previously discussed with a player because it may change its mind about what moves to make, but it does not intentionally lie in an effort to mislead opponents.
The problem with this statement is it assigns intention to a AI model.
It does not 'intend' to lie... but still may effectively do so. Lying may be the wrong word (as it presumes intent)... it's hard to express the concern I have of a model learning from games like diplomacy without using words that infer intent. Maybe it is the idea of it learning to better manipulate humans.
But I would not trust a system, any system trained on diplomacy or any similar game.
But you can tell what the plan is at a given turn in a model like this, because you can extract its representation of the plan directly.
By analogy think of how Stockfish can evaluate multiple positions; in this case it’s coming up with a plan then serializing that plan to the language model. There is no room for deception between the AI and a researcher that is probing the model activations directly.
(This sort of AI->researcher deception is the scary scenario for AI risk researchers though, it comes into play when the models are so smart/complex that you can’t extract their internal representation directly. See the Eliciting Latent Knowledge paper for a deep dive https://www.lesswrong.com/tag/eliciting-latent-knowledge-elk)
I think “intention” is meaningful here, understood in the lay sense of “has formed a plan of action”. It’s a simple and bounded plan, but a plan nonetheless.
The problem with this statement is it assigns intention to a AI model. It does not 'intend' to lie... but still may effectively do so. Lying may be the wrong word (as it presumes intent)... it's hard to express the concern I have of a model learning from games like diplomacy without using words that infer intent. Maybe it is the idea of it learning to better manipulate humans.
But I would not trust a system, any system trained on diplomacy or any similar game.