nibbajenkem t1_j4wii8d wrote on January 18, 2023 at 7:19 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

It's pretty simple. Deep neural networks are extremely underspecified by the data they train on https://arxiv.org/abs/2011.03395. Less data means more underspecification and thus the model more readily gets stuck in local minima. More data means you can more easily avoid certain local minima. So the question then boils down to the transferability of the learned features on different datasets. Imagenet pretraining generally works well because its a diverse and large scale dataset, which means models trained on it will by default avoid learning a lot of "silly" features.

DrXaos t1_j4w9vav wrote on January 18, 2023 at 6:26 PM

Reply to comment by tsgiannis in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

The data size of the pre trained model was likely enormously larger than yours and that overcomes the distribution shift.

Quiet-Investment-734 OP t1_j4w3ujs wrote on January 18, 2023 at 5:50 PM

Reply to comment by trajo123 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

Yes, the previous classes do remain.

The transfer learning solution seems a decent option, will try it out. Thank you

Quiet-Investment-734 OP t1_j4w3eby wrote on January 18, 2023 at 5:47 PM

Reply to comment by BrotherAmazing in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

This is exactly what I was thinking of doing, but I wanted to know if there are any comparatively efficient methods to achieve the same.

tsgiannis OP t1_j4vw83q wrote on January 18, 2023 at 5:03 PM

Reply to comment by Present-Ad-8531 in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

I know and this is what I use but ....

Just picture this in your mind..
You want to classify for example sports cars...shouldn't be reasonable to have images of sports cars and feed it to a model and let it learn..compared to images of frogs, butterflies ..etc (imagenet)

tsgiannis OP t1_j4vvxt3 wrote on January 18, 2023 at 5:02 PM

Reply to comment by jsxgd in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

from scratch I mean I take the implementation of a model (just pick any) from articles and github pages, I copy paste it and I feed my data.

There is always a big accuracy difference no matter what...at first I thought it was my mistake because I always tinker what I copy but....

Present-Ad-8531 t1_j4vvb1l wrote on January 18, 2023 at 4:58 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Transfer learning.

jsxgd t1_j4vruhl wrote on January 18, 2023 at 4:37 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Are you saying in the “from scratch” implementation, you are only training using your own data? Or you are training the same architecture on the data used in pre-training + your own data?

tsgiannis OP t1_j4v447x wrote on January 18, 2023 at 1:57 PM

Reply to comment by XecutionStyle in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

No changes on the pretrained model besides removing the top layer.

I am aware that unfreezing can cause either good or bad results

tsgiannis OP t1_j4v3ysi wrote on January 18, 2023 at 1:55 PM

Reply to comment by loopuleasa in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Now that's something to discuss..

>more resource

Now this is something well known...so skip it for now

>better tuning

This is the interesting info

What exactly do you mean on this... is it right to assume that all the papers that explain the architecture lack some fine details or is it something else.

loopuleasa t1_j4v0ve4 wrote on January 18, 2023 at 1:31 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

the pretrained model was trained on more resources and with better tuning than you were able to provide

XecutionStyle t1_j4v0dui wrote on January 18, 2023 at 1:27 PM

Reply to comment by tsgiannis in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

When you replace the top layer and train the model, are the previous layers allowed to change?

Buddy77777 t1_j4uvmzx wrote on January 18, 2023 at 12:45 PM

Reply to comment by tsgiannis in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

If it’s converged on validation very flatly, it’s likely converged at a local minimum possibly for reasons I mentioned above… but also you can try adjusting hyper parameters, choosing curated weight intitializations (not pretrained), data augmentation, and the plethora of techniques that fall into the broad category of adversarial training.

tsgiannis OP t1_j4uukd9 wrote on January 18, 2023 at 12:35 PM

Reply to comment by XecutionStyle in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

What exactly do you mean ...by "fixing weights" ?

The pretrained carries the weights from ImageNet and that's all .. if I unfreeze some layers it will get some more accuracy

But the "from scratch" starts empty.

tsgiannis OP t1_j4uu9q4 wrote on January 18, 2023 at 12:32 PM

Reply to comment by Buddy77777 in Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Thanks for the reply and I agree with you but...

Right now I am seeing the training of my model....it simply found a converging point and it's stuck around 86%+ training accuracy and 85%+ validation accuracy ... and I have observed this behavior more than once... so I am just curious.

Anyway probably the best answer is that it doesn't get enough features and its stuck ...because its unable to make some crucial separations.

XecutionStyle t1_j4uu8r4 wrote on January 18, 2023 at 12:32 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Are you fixing the weights of the earlier layers?

Buddy77777 t1_j4utcmg wrote on January 18, 2023 at 12:23 PM

Reply to Why a pretrained model returns better accuracy than the implementation from scratch by tsgiannis

Assuming no bugs, consider there are features and representations that can generalize to varying degrees and at varying receptive fields corresponding to neural depth.

If a pretrained CNN has gone through extensive training, the representations it has learned on its kernels across millions of images suggests it has already learned many generalizable features that seem to generalize to your dataset very well.

These could range from having Gabor filters from the get go at low receptive fields near the CNN surface to more complex but generalizable features deeper within.

It’s possible and likely pretraining went through extensive hyperparameter tuning (including curated weight initialization) that permitted it routes to relatively better optima that yours.

It’s possible with enough time, your implementation would reach that accuracy as well… but consider how long it takes to train effectively on millions of images! Even from the same starting weights, the pretrained model likely has significant training advantage.

You have the right intuition that you’d expect a model trained on a restricted domain to do better on that domain… but often time intuition in ML is backwards. Restricting the domain can permit networks to exploit weaker representations (especially true for classification tasks where, compared to something like segmentation, requires much less representational granularity).

Pretraining on many images, however, enforces a more robust model by requiring further differentiation which needs stronger representations. These stronger representations can make the difference for those edge case samples that can take 88% to 95%. Especially if representations are weak already and can generally get away with it, thereby having much lesser possibility of optima pseudo-exploration due to having no competing features and classes, one could suggest that it’s really easy to fall into a high local minimum.

I’m sure there are more possibilities we could theorize… and I’m quite possibly wrong about some of the ones I suggested… but, as you’ll discover with ML, things are more empirical than theoretical. To some (no matter how small) extent, the question: why does it work better can be answered with: it just does lol.

Rereading your post: key focus on your point about quickly reaching 95%. Again, it already has learned features it can exploit and perhaps just has to reweight its linear classifier, for example.

Anyways, my main point is that, generally, the outcome you witnessed is not surprising for reasons I gave and possibly other reasons as well

Oh, and keep in mind not all hyperparameters are equal, which is to say not all training procedures are equal. Their training setup is very likely to be an important factor and edge even if all else was equal.

Model performance is predicted by 1/3 data quality/quantity, 1/3 parameter count, 1/6 neural architecture, 1/6 training procedures/hyperparameter, and of course 99/100 words of encouragement via the terminal.

BrotherAmazing t1_j4tjnj1 wrote on January 18, 2023 at 3:42 AM

Reply to comment by ed3203 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

I would like to see what happens if you train an N-class classifier with a final FC output layer that is of size (N+M) x 1 and you simply pretend there are “M” unknown classes that you have no training examples for so those M components are always 0 for your initial training set and you always make predictions by re-normalizing/conditioning on the fact that those elements are 0.

Now you add a new class with your spare “capacity” in that last layer and start re-training from where you left off without modifying the architecture, but now some data have non-zero labels for the N+1st class and now you re-normalize predictions only to condition on the last M-1 classes being 0 instead of M.

Then see how training starting from this initially trained N-class network progresses in becoming an (N+1)-class classifier compared to the baseline of just starting over from scratch and see whether it saves you compute time for certain problems while simultaneously being just as accurate in the end (or not!).

IDK how practical or important this would really be (probably not much!) even if it did lead to computational savings, but would be a fun little nerdy study.

ed3203 t1_j4slyft wrote on January 17, 2023 at 11:42 PM

Reply to comment by WinterExtreme9316 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

Yes, you may arrive at a different local minima which could be more performant. You give the model more freedom to explore. OP gave no context, if it's a huge transformer model for instance that would be impractical to retain then sure use the model as is with a different final classification layer.

WinterExtreme9316 t1_j4sl2cg wrote on January 17, 2023 at 11:36 PM

Reply to comment by trajo123 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

Also since last layer is likely a dense/fullyconnect layer, you should be able to start out with original weights (padded for new output depth).

WinterExtreme9316 t1_j4skh3m wrote on January 17, 2023 at 11:32 PM

Reply to comment by ed3203 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

Why? If you're just adding a category, why not use what you've got and just train the last layer. You mean in case the new category has some unique low-level feature that early layers of network need to extract?

ed3203 t1_j4sim4z wrote on January 17, 2023 at 11:19 PM

Reply to comment by trajo123 in Increasing number of output nodes on addition of a new class by Quiet-Investment-734

Depending on the size of the training data and the network it may be better to just retain the whole thing from scratch

trajo123 t1_j4rzvz3 wrote on January 17, 2023 at 9:19 PM

Reply to Increasing number of output nodes on addition of a new class by Quiet-Investment-734

What do you mean by "dynamically changing the structure"? Do the previous classes remain?

One solution is to treat this as a transfer learning problem and get rid of the last layer and re-train your network with the new set of classes when the set of classes changes.

2thleZ t1_j4q3m37 wrote on January 17, 2023 at 2:04 PM

Reply to comment by myth_drannon in Is doing TensorFlow certificate worth the time and effort? by MACKBULLERZ

Thanks for the link!

myth_drannon t1_j4pz3k5 wrote on January 17, 2023 at 1:27 PM

Reply to comment by 2thleZ in Is doing TensorFlow certificate worth the time and effort? by MACKBULLERZ

Well, you can find plenty of articles on that, I won't repeat them. Even from absolute job numbers now pytorch is a more common requirement.

https://www.hntrends.com/2022/november.html?compare=PyTorch&compare=TensorFlow&compare=&compare=

Recent comments in /f/deeplearning