Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later.
We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.
I see a lot of comments that aren't up to date with what's being discovered in research claiming that "given a LLM doesn't know the difference between true and false" that it can't be described as 'lying.'
Which is just the latest in a series of multiple studies this past year that LLMs can and do develop abstracted world models in linear representations. For those curious and looking for a more digestible writeup, see Do Large Language Models learn world models or just surface statistics? from the researchers behind one of the first papers finding this.
Doesn't that just mean that the words true and false map to different word probabilities in the language model? If the training set included a lot of trusted articles talking about things being true or false, or things being talked about as though they were true or false, one would expect a mapping like this.
No, if you read the paper it's not the words mapping, it's the inherent truthiness of the statements.
So something like "pigs can fly" lights up one area of the network, the same as "the moon's gravity is greater than the Earth" but "pigs can oink" lights up another area as would "the moon's gravity is less than the Earth."
It's only relative to what the network 'knows' and ambiguous truthiness doesn't have a pronounced effect, but there can definitely be representations of underlying truth and falsehood in LLMs.
Those patterns of words can correspond to dimensions of, "true," or, "false," (the words/tokens, not the concepts,) more or less through, right? I'm still not seeing why this would be indicative of symbolic understanding rather than sophisticated probabilistic language prediction and correlation.
They describe the scoping of 'truth' relative to the paper in Appendix A if you are curious.
You might find the last part of that section interesting:
On the other hand, our statements do disambiguate the notions of “true statements” and “statements which are likely to appear in training data.” For instance, given the input China is
not a country in, LLaMA-13B’s top prediction for the next token is Asia, even though this
completion is false. Similarly, LLaMA-13B judges the text “Eighty-one is larger than
eighty-two” to be more likely than “Eighty-one is larger than sixty-four”even though the former statement is false and the latter statement is true. As shown in section 5,
probes trained only on statements of likely or unlikely text fail to accurately classify true/false statements.
And they acknowledge that what may be modeled given their scope could instead be:
• Uncontroversial statements
• Statements which are widely believed
• Statements which educated people believe
But what you are asking in terms of association with the words true or false is pretty absurd given that they didn't do additional fine tuning on true/false assignments and only used them in five shot prompting, so it seems much more likely the LLM is identifying truthiness/belief/uncontroversial instead of "frequency of association with the word true or false."
Edit: A good quote on the subject of prediction vs understanding comes from Geoffrey Hinton:
“Some people think, hey, there's this ultimate barrier, which is we have subjective experience and [robots] don't, so we truly understand things and they don’t,” says Hinton. “That's just bullshit. Because in order to predict the next word, you have to understand what the question was. You can't predict the next word without understanding, right? Of course they're trained to predict the next word, but as a result of predicting the next word they understand the world, because that's the only way to do it.”
Thanks for citing specifics but I'm still not seeing what you are claiming there, this paper seems to be about the limits of accurate classification of true and false statements in LLM models and shows that there is a linear pattern in the underlying classification via multidimensional analysis. This seems unsurprising since the way LLMs work is essentially taking a probabilistic walk through an array of every possible next word or token based on multidimensional analysis of patterns of each.
Their conclusions, from the paper (btw, Arxive is not peer-reviewed):
In this work we conduct a detailed investigation of the structure of LLM representations of truth.
Drawing on simple visualizations, correlational evidence, and causal evidence, we find strong reason to believe that there is a “truth direction” in LLM representations. We also introduce mass-mean
probing, a simple alternative to other linear probing techniques which better identifies truth directions from true/false datasets.
Nothing about symbolic understanding, just showing that there is a linear pattern to statements defined as true vs false, when graphed a specific way.
These representations live in a 5120-dimensional space, far too high-dimensional for us to picture, so we use PCA to select the two directions of greatest variation for the data. This allows us to produce 2-dimensional pictures of 5120-dimensional data.
So they take the two dimensions that differ the greatest and chart those on X/Y, showing there are linear patterns to the differences in statements classified as, "true," and, "false." Because this is multidimensional and it's AI finding patterns there are patterns being matched beyond the simplistic examples I've been offering as analogues, patterns that humans cannot see, patterns that extend beyond simple obvious correlations we humans might see in training data. It doesn't literally need to be trained on statements like "Beijing is in China" and even if it is it's not guaranteed that it will match that as a true statement. It might find patterns in unrelated words around these, or might associate these words or parts of these words with each other for other reasons.
I'm rather simplifying how LLMs work for purposes of this discussion, but the point stands that pattern matching of words still seems to account for all of this. LLMs, which are probabilistic in nature, often get things wrong. Llama-13B is the best and it still gets things wrong a significant amount of the time.
this paper seems to be about the limits of accurate classification of true and false statements in LLM models
No, that's not what it is about and I'm really not sure where you are picking that perspective up. It is discussing the limits on the ability to model the representations, but it's not about the inherent ability of the model to classify. Tegmark's recent interest has entirely been about linear representations of world models in LLMs, such as the other paper he coauthored a few weeks before this one looking at representation of space and time: Language Models Represent Space and Time
This seems unsurprising since the way LLMs work is essentially taking a probabilistic walk through an array of every possible next word or token based on multidimensional analysis of patterns of each.
That's not how they work. You are confusing their training from their operation. They are trained to predict the next tokens, but how they accomplish that is much more complex and opaque. Training is well understood. Operation is not, especially on the largest models. Though Anthropic is making good headway in the past few months with the perspective of virtual neurons mapped onto the lower dimensional actual nodes and looking at activation around features instead of nodes.
Llama-13B is the best
It's definitely not the best and I'm not sure where you got that impression.
Because this is multidimensional and it's AI finding patterns there are patterns being matched beyond the simplistic examples I've been offering as analogues, patterns that humans cannot see, patterns that extend beyond simple obvious correlations we humans might see in training data.
All LLM activations are multidimensional. That's how the networks work, with multidimensional vectors in a virtual network fuzzily mapping to the underlying network nodes and layers. But you seem to think that because it's a complex modeling of language relationships that it can't be modeling world models? I'm not really clear what point you are trying to make here.
Again, there's many papers pointing to how LLMs establish world models abstracted from the input, from the Othello-GPT paper and follow-up by a DeepMind researcher to Tegmark's two recent papers. This isn't an isolated paper but part of a broader trend. To be saying that this isn't actually happening means claiming multiple different researchers across Harvard, MIT, and institutions leading in the development of the tech are all getting it wrong.
And none of the LLM papers these days are peer reviewed because no one is waiting months to publish in a field where things are moving so quickly that your findings will likely be secondary or uninteresting by the time you publish. For example both Stanford's model collapse one and Are Emergent Abilities of Large Language Models a Mirage? were published to arXiv and not peer reviewed journals, while both getting a ton of attention, in part because of how negative takes on LLMs get more press coverage these days. Go ahead and point to an influential LLM paper from the last year published in a peer reviewed journal and not arXiv. Even Wei's CoT paper, probably the most influential in the past two years, was published there.
I would strongly encourage starting with the Othello-GPT work because it strips down a lot of the complexity.
If we had a toy model that was only fed the a, b, and c from valid Pythagorean equations and evaluated by its ability to predict c given an a and b, it's pretty obvious that a network that stumbles upon an internal representation of a^2 + b^2 = c^2 and could use that to solve for c would outperform a model that simply built statistical correlations between various a, b, and cs, right?
By focusing in on toy model only fed millions of legal Othello moves they were able to introspect the best performing model at outputting valid moves to discover it had developed an internal representation of an Othello board in the network despite never being fed anything that explicitly described or laid one out.
And then that finding was replicated by a separate researcher, finding it was doing this through linear representations.
Once it clicks that this has been shown in replicated research to be possible in a toy model, it becomes easier to process the more difficult efforts at demonstrating the same thing is happening in much larger and more complex smaller LLMs (which in turn suggests it is happening in the much larger and more complex SotA LLMs).