Submitted by Cool_Abbreviations_9 t3_123b66w in MachineLearning
was_der_Fall_ist t1_jdw2fud wrote
Reply to comment by sineiraetstudio in [D]GPT-4 might be able to tell you if it hallucinated by Cool_Abbreviations_9
I’m pretty much just quoting Paul Christiano, alignment researcher at ARC and previously OpenAI, in a comment thread on this LessWrong post.
Someone comments pretty much the same thing the person I replied to did:
> “GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced.” What??? This is so weird and concerning.
To which Paul replies:
> If I ask a question and the model thinks there is an 80% the answer is "A" and a 20% chance the answer is "B," I probably want the model to always say "A" (or even better: "probably A"). I don't generally want the model to say "A" 80% of the time and "B" 20% of the time.
>In some contexts that's worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it's not really lying---it's also the behavior you'd expect out of a human who is trying to be helpful.
>More precisely: when asked a question the pre-trained model outputs a probability distribution over what comes next. If prompted correctly you get its subjective probability distribution over the answer (or at least over the answer that would appear on the internet). The RLHF model instead outputs a probability distribution over what to say take next which is optimized to give highly-rated responses. So you'd expect it to put all of its probability mass on the best response.
>… If it is forced to say either "yes" or "no" the RLHF model will just give the more likely answer 100% of the time, which will show up as bad calibration on this graph. The point is that for most agents "the probability you say yes" is not the same as "the probability you think the answer is yes." This is the case for pretrained models.
sineiraetstudio t1_jdwbuig wrote
I don't see how this is arguing it's a good thing, it's just a justification (which I'd expect from Paul Christiano, he's a huge fan of RLHF). The model is becoming overconfident in it's answers - how could you possibly spin that as a positive?
was_der_Fall_ist t1_jdwdxut wrote
My understanding is that rather than being overconfident in their answers, they simply produce the answer they’re most confident in instead of differentially saying each answer proportional to how confident they are. This seems similar to how humans work — if you ask me a yes or no question and I’m 80% sure the answer is yes, I’m going to say “yes” every time; I’m not going to say “no” 20% of the times you ask me, even though I assign a 20% chance that “no” is correct. In other words, the probability I say yes is not the same as the probability I assign to yes being correct. But I admit there are subtleties to this issue with which I am unfamiliar.
sineiraetstudio t1_jdws2iv wrote
(The graph doesn't give enough information to determine whether it's actually becoming more confident in its high-confidence answers, but it sounds like a reasonable enough rationale.)
I'm not sure I understand what distinction you're trying to draw. The RLHF'd version assigns higher confidence to answers than it actually gets correct, unlike the original pre-trained version. That's literally the definition of overconfidence.
You might say that this is more "human-like", but being human-like doesn't mean that it's good. If you want only the most likely answer, you can already do this via the sampler, while on the hand calibration errors are a straight up downside as Paul Christiano explicitly mentions in the part you quoted. If you need accurate confidence scores (because you e.g. only want to act if you're certain), being well-calibrated is essential.
was_der_Fall_ist t1_jdwz4qw wrote
I think you make a good point. We probably need better methods of post-training LLMs. But it does seem like the current regime is still sometimes more useful than the pre-trained model, which Christiano also says. It's only in some contexts that this behavior is worse. I'm not sure if it's really better than top-p sampling, though. I'm not sure that it is. But RLHF models do seem pretty useful.
sineiraetstudio t1_jdymf8q wrote
Oh, RLHF absolutely has all sorts of benefits (playing with top-p only makes answers more consistent - but sometimes you want to optimize for something different than "most likely"), so it's definitely here to stay (for now?), it's just not purely positive. Ideally we'd have a RLHF version that's still well calibrated (or even better, some way to determine confidence without looking at logits that also works with chain of thought prompting).
meister2983 t1_jdwu6ig wrote
It's necessary to improve overall performance; GPT-4 isn't just a thing to answer multiple choice questions.
E.g. Accuracy on adversarial questions (Truthful QA) goes from 40% to 60%.
sineiraetstudio t1_jdwvmxb wrote
Are you talking about RLHF in general? I'm specifically referring to the calibration error, which is separate from accuracy.
meister2983 t1_jdx06k9 wrote
Yes. RLHF both increases accuracy on certain tests while decreasing calibration on others.
 [D]GPT-4 might be able to tell you if it hallucinated
[D]GPT-4 might be able to tell you if it hallucinated
Viewing a single comment thread. View all comments