yldedly
yldedly t1_irhr4lg wrote
Reply to comment by Competitive-Rub-1958 in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.
yldedly t1_iraslo4 wrote
Reply to comment by Competitive-Rub-1958 in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.
yldedly t1_ir8ya0q wrote
Reply to comment by Competitive-Rub-1958 in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
LEGO paper?
yldedly t1_ir6reto wrote
Reply to comment by Optional_Joystick in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.
yldedly t1_ir5n8r8 wrote
Reply to comment by Optional_Joystick in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
>we get better results^(*)
^(*)results on out-of-distribution data sold separately
yldedly t1_irrrqyi wrote
Reply to comment by Competitive-Rub-1958 in [R] Self-Programming Artificial Intelligence Using Code-Generating Language Models by Ash3nBlue
You've shifted my view closer to yours. What you say about pretraining and priors makes a lot of sense. But I still think shortcut learning is a fundamental problem irrespective of scale - it becomes less of a problem with scale, but not quickly enough. For modern ml engineering, pretraining is a boon, but for engineering general intelligence, I think we need stronger generalization than is possible without code as representations and causal inference.