Recent comments in /f/deeplearning

trajo123 t1_j3lwv4f wrote

> 90% accuracy in my test

Looking at accuracy can be misleading if your dataset is imbalanced. Let's say 90% of your data is labelled as False and only 10% of your data is labelled as True, so even a model that doesn't look at the input at all and just predicts False all the time will have 90% accuracy. A better metric for binary classification is the F1 score, but that also depends on where you set the decision threshold (the default is 0.5, but you can change that to adjust the confusion matrix). Perhaps the most useful metric to see how much your model learned is the Area under the ROC curve aka ROC_AUC score (where 0.5 is the same as random guessing and 1 is a perfect classifier).

1

AKavun OP t1_j3l4gb1 wrote

u/trajo123 u/FastestLearner u/trajo123

I am giving this as a general update. In my original post, I said "I am doing something very obvious wrong" and indeed I was. The reason my model did not learn at all was that the whole python script with the exception of my main method was being re-executed every few seconds which actually caused my model to reinitilize and reset. I believe this was caused by PyTorch's handling of the "num_workers" parameter in the dataloader which tries to do some multithreading magic and ends up re-executing the script multiple times.

So fixing that allowed my model to learn but it still performed poorly due to the reasons all of you so generously explained in great detail. My first instinctive reaction to this was to switch to resnet18 and change the output layer. I also switched to crossentropy loss as I learned I can still use softmax in postprocessing to obtain the prediction confidence, this was something I did not think it was possible to do previously. Now my model performs with 90% accuracy in my test set and rest I think is just tweaking the hyperparameters, enlarging and augmenting the data, and maybe doing some partial training with different learning rates etc.

However I still do want to learn how to design an architecture from scratch so I am experimenting with that after carefully reading the answers you provided. I thank each of you so much and wish all the success in your careers. You are great people and we are a great community

2

hjups22 t1_j3l3ln2 wrote

Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).

Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).

2

hjups22 t1_j3l1l6n wrote

That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.

Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.

Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.

4

VinnyVeritas t1_j3l04w8 wrote

3

qiltb t1_j3kjvki wrote

I actually works very well with ADD2PSU connector (used like 5 PSUs for one 14x3090 rig). He should actually think more of getting 1600W HIGH QUALITY PSU.

Corsair RM series IS NOT SUITABLE for workload you are looking for. Use preferrably AXi series or HXi if you really want to cheap out. We are talking about really abusing those PSUs. AX1600i is still unmatched for this usecase.

1

hjups22 t1_j3k2kei wrote

What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?

Or if you are doing inference, what types of models do you intend to run.

The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.

6