yldedly

yldedly t1_irrrqyi wrote

You've shifted my view closer to yours. What you say about pretraining and priors makes a lot of sense. But I still think shortcut learning is a fundamental problem irrespective of scale - it becomes less of a problem with scale, but not quickly enough. For modern ml engineering, pretraining is a boon, but for engineering general intelligence, I think we need stronger generalization than is possible without code as representations and causal inference.

2

yldedly t1_irhr4lg wrote

If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.

2

yldedly t1_iraslo4 wrote

Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.

2

yldedly t1_ir6reto wrote

Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.

3