Recent comments in /f/deeplearning
hjups22 t1_j3lk4e5 wrote
Reply to comment by qiltb in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Could you elaborate on what you mean by that?
The advantage of NVLink is gradient / weight communication, which is independent of image size.
qiltb t1_j3l9suz wrote
Reply to comment by hjups22 in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
that also depends on input image size though...
qiltb t1_j3l9q8s wrote
Reply to comment by VinnyVeritas in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Well, in just the most basic tasks - like plain resnet100 training (classification) by using nvlink - there is a huge difference.
qiltb t1_j3l9hll wrote
Reply to comment by soupstock123 in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Under full load, AXi series is basically silent. But main reason is that PSU is not of high enough quality to actually sustain that load (even higher grade PSUs like EVGA P2 series has problems with infamous 3090 under DL task load) . Also take a look at my big comment on this reddit post.
AKavun OP t1_j3l51kx wrote
Reply to comment by trajo123 in Why didn't my convolutional image classifier network learn anything! by AKavun
Thank you sir, I posted a general update to this thread and I will be further updating you about everything.
AKavun OP t1_j3l4gb1 wrote
u/trajo123 u/FastestLearner u/trajo123
I am giving this as a general update. In my original post, I said "I am doing something very obvious wrong" and indeed I was. The reason my model did not learn at all was that the whole python script with the exception of my main method was being re-executed every few seconds which actually caused my model to reinitilize and reset. I believe this was caused by PyTorch's handling of the "num_workers" parameter in the dataloader which tries to do some multithreading magic and ends up re-executing the script multiple times.
So fixing that allowed my model to learn but it still performed poorly due to the reasons all of you so generously explained in great detail. My first instinctive reaction to this was to switch to resnet18 and change the output layer. I also switched to crossentropy loss as I learned I can still use softmax in postprocessing to obtain the prediction confidence, this was something I did not think it was possible to do previously. Now my model performs with 90% accuracy in my test set and rest I think is just tweaking the hyperparameters, enlarging and augmenting the data, and maybe doing some partial training with different learning rates etc.
However I still do want to learn how to design an architecture from scratch so I am experimenting with that after carefully reading the answers you provided. I thank each of you so much and wish all the success in your careers. You are great people and we are a great community
rockpooperscissors t1_j3l41yd wrote
There's some stats out there that suggest that NBA analyst are only correct 70% of the time. 75% accurate seems good
hjups22 t1_j3l3ln2 wrote
Reply to comment by soupstock123 in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).
Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).
soupstock123 OP t1_j3l2srl wrote
Reply to comment by hjups22 in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Right now mostly CNNs, RNNs, and playing around with style transfers with GANs. Future plans include running computer vision models trained on videos and testing inferencing, but still researching how demanding that would be.
hjups22 t1_j3l1l6n wrote
Reply to comment by VinnyVeritas in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.
Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.
Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.
soupstock123 OP t1_j3l0q8f wrote
Reply to comment by VinnyVeritas in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Yeah, that's what I've basically discovered too. The mobo with the 16 PCIe lanes isn't going to work out. Changed my build to threadripper. Any advice or suggestions for a PSU that can handle the workload?
soupstock123 OP t1_j3l0lhk wrote
Reply to comment by qiltb in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Thanks for the advice, can you elaborate more about how the Corsair RM series is not suitable for the workload? My rationale was that because it's an open air mining frame instead of a case, I wanted the RM series which supposedly is quieter?
VinnyVeritas t1_j3l0gqt wrote
Reply to Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
I don't know if that's going to work well to have 16 PCIe lane, everyone here I've seen making 4 GPUs machines uses the CPUs that have 48 or 16 PCIe lanes.
Also you'll need a lot of watts to power that monster, not to mention you need a 10-20% margin if you don't want fry the PSU.
soupstock123 OP t1_j3l0fmt wrote
Reply to comment by Volhn in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
I plan on using this frequently, and compared to even renting a similar configuration online, I would break even after a year.
VinnyVeritas t1_j3l04w8 wrote
Reply to comment by hjups22 in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Each time someone asks this question, someone repeats this misinformed answer.
This is incorrect, NVLink doesn't make much difference.
soupstock123 OP t1_j3kzq2x wrote
Reply to comment by emanresuymsseug in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Damn, 1 lane each for the other 3 isn't enough for my needs. The bifurcation kinda sucks here. Thanks for the advice.
Volhn t1_j3kx4ut wrote
Reply to Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
Just get a single 3090 and the 7950x or 13 series intel, then use the rest that you would have spent on renting bigger GPUs in the cloud.
junetwentyfirst2020 t1_j3kqzbm wrote
Reply to comment by vagartha in Building an NBA game prediction model - failing to improve between epochs by vagartha
Every time! Training can take a long time so I’d hate to walk away and come back the next day and see it stuck 😭 this will work even if your labels are incorrect.
vagartha OP t1_j3kmj7i wrote
Reply to comment by junetwentyfirst2020 in Building an NBA game prediction model - failing to improve between epochs by vagartha
That’s a really good idea lol. I’ll try that and get back to you.
As an aside, do you always do this to test your process?
junetwentyfirst2020 t1_j3kmct1 wrote
Have you tired to overfit on a single piece of data to ensure that your model can actually learn? You should be able to get effectively 100% acc overfitting. If you can’t do this then you have a problem
qiltb t1_j3kjvki wrote
Reply to comment by rikonaka in Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
I actually works very well with ADD2PSU connector (used like 5 PSUs for one 14x3090 rig). He should actually think more of getting 1600W HIGH QUALITY PSU.
Corsair RM series IS NOT SUITABLE for workload you are looking for. Use preferrably AXi series or HXi if you really want to cheap out. We are talking about really abusing those PSUs. AX1600i is still unmatched for this usecase.
[deleted] t1_j3kix1w wrote
emanresuymsseug t1_j3kdzbv wrote
Reply to Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
> - with 4 gpus that's 4 PCIe4 lanes per gpu
With the Asus PRIME B650M-A AX you are looking at 16 lanes for 1 GPU and 1 lane each for the other 3 GPUs.
PCIEX16_2, PCIEX16_3 and PCIEX16_4 slots are electrically connected in x1 mode.
Bifurcation is only supported via PCIEX16_1 slot.
hjups22 t1_j3k2kei wrote
Reply to Building a 4x 3090 machine learning machine. Would love some feedback on my build. by soupstock123
What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?
Or if you are doing inference, what types of models do you intend to run.
The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.
trajo123 t1_j3lwv4f wrote
Reply to comment by AKavun in Why didn't my convolutional image classifier network learn anything! by AKavun
> 90% accuracy in my test
Looking at accuracy can be misleading if your dataset is imbalanced. Let's say 90% of your data is labelled as False and only 10% of your data is labelled as True, so even a model that doesn't look at the input at all and just predicts False all the time will have 90% accuracy. A better metric for binary classification is the F1 score, but that also depends on where you set the decision threshold (the default is 0.5, but you can change that to adjust the confusion matrix). Perhaps the most useful metric to see how much your model learned is the Area under the ROC curve aka ROC_AUC score (where 0.5 is the same as random guessing and 1 is a perfect classifier).