Recent comments in /f/deeplearning

nibbajenkem t1_j4wii8d wrote

It's pretty simple. Deep neural networks are extremely underspecified by the data they train on https://arxiv.org/abs/2011.03395. Less data means more underspecification and thus the model more readily gets stuck in local minima. More data means you can more easily avoid certain local minima. So the question then boils down to the transferability of the learned features on different datasets. Imagenet pretraining generally works well because its a diverse and large scale dataset, which means models trained on it will by default avoid learning a lot of "silly" features.

14

tsgiannis OP t1_j4v3ysi wrote

Now that's something to discuss..

>more resource

Now this is something well known...so skip it for now

>better tuning

This is the interesting info

What exactly do you mean on this... is it right to assume that all the papers that explain the architecture lack some fine details or is it something else.

1

Buddy77777 t1_j4uvmzx wrote

If it’s converged on validation very flatly, it’s likely converged at a local minimum possibly for reasons I mentioned above… but also you can try adjusting hyper parameters, choosing curated weight intitializations (not pretrained), data augmentation, and the plethora of techniques that fall into the broad category of adversarial training.

3

tsgiannis OP t1_j4uu9q4 wrote

Thanks for the reply and I agree with you but...

Right now I am seeing the training of my model....it simply found a converging point and it's stuck around 86%+ training accuracy and 85%+ validation accuracy ... and I have observed this behavior more than once... so I am just curious.

Anyway probably the best answer is that it doesn't get enough features and its stuck ...because its unable to make some crucial separations.

1

Buddy77777 t1_j4utcmg wrote

Assuming no bugs, consider there are features and representations that can generalize to varying degrees and at varying receptive fields corresponding to neural depth.

If a pretrained CNN has gone through extensive training, the representations it has learned on its kernels across millions of images suggests it has already learned many generalizable features that seem to generalize to your dataset very well.

These could range from having Gabor filters from the get go at low receptive fields near the CNN surface to more complex but generalizable features deeper within.

It’s possible and likely pretraining went through extensive hyperparameter tuning (including curated weight initialization) that permitted it routes to relatively better optima that yours.

It’s possible with enough time, your implementation would reach that accuracy as well… but consider how long it takes to train effectively on millions of images! Even from the same starting weights, the pretrained model likely has significant training advantage.

You have the right intuition that you’d expect a model trained on a restricted domain to do better on that domain… but often time intuition in ML is backwards. Restricting the domain can permit networks to exploit weaker representations (especially true for classification tasks where, compared to something like segmentation, requires much less representational granularity).

Pretraining on many images, however, enforces a more robust model by requiring further differentiation which needs stronger representations. These stronger representations can make the difference for those edge case samples that can take 88% to 95%. Especially if representations are weak already and can generally get away with it, thereby having much lesser possibility of optima pseudo-exploration due to having no competing features and classes, one could suggest that it’s really easy to fall into a high local minimum.

I’m sure there are more possibilities we could theorize… and I’m quite possibly wrong about some of the ones I suggested… but, as you’ll discover with ML, things are more empirical than theoretical. To some (no matter how small) extent, the question: why does it work better can be answered with: it just does lol.

Rereading your post: key focus on your point about quickly reaching 95%. Again, it already has learned features it can exploit and perhaps just has to reweight its linear classifier, for example.

Anyways, my main point is that, generally, the outcome you witnessed is not surprising for reasons I gave and possibly other reasons as well

Oh, and keep in mind not all hyperparameters are equal, which is to say not all training procedures are equal. Their training setup is very likely to be an important factor and edge even if all else was equal.

Model performance is predicted by 1/3 data quality/quantity, 1/3 parameter count, 1/6 neural architecture, 1/6 training procedures/hyperparameter, and of course 99/100 words of encouragement via the terminal.

17

BrotherAmazing t1_j4tjnj1 wrote

I would like to see what happens if you train an N-class classifier with a final FC output layer that is of size (N+M) x 1 and you simply pretend there are “M” unknown classes that you have no training examples for so those M components are always 0 for your initial training set and you always make predictions by re-normalizing/conditioning on the fact that those elements are 0.

Now you add a new class with your spare “capacity” in that last layer and start re-training from where you left off without modifying the architecture, but now some data have non-zero labels for the N+1st class and now you re-normalize predictions only to condition on the last M-1 classes being 0 instead of M.

Then see how training starting from this initially trained N-class network progresses in becoming an (N+1)-class classifier compared to the baseline of just starting over from scratch and see whether it saves you compute time for certain problems while simultaneously being just as accurate in the end (or not!).

IDK how practical or important this would really be (probably not much!) even if it did lead to computational savings, but would be a fun little nerdy study.

2

ed3203 t1_j4slyft wrote

Yes, you may arrive at a different local minima which could be more performant. You give the model more freedom to explore. OP gave no context, if it's a huge transformer model for instance that would be impractical to retain then sure use the model as is with a different final classification layer.

5