Recent comments in /f/deeplearning

lazazael t1_j0fr6zz wrote

if not the cloud because u don't want ongoing payment than remote compute on a desktop with 256 ram&4090 heating the office, an instance like that is a beast for ML compared to these...slim contenders, university professors usually do that, they buy a heavy lifter for a few of them to use freely from their mb airs or thinkpads

1

elbiot t1_j0fitwk wrote

Can't you just reshape the array and use argmax (so no as_strided). Reshaping is often free. You'd have to do some arithmetic to get the indices for the original shape, but it would just be one operation

I.e. you can take a shape (99,) array and reshape it to (3,33) and then get 33 maxes.

1

BrotherAmazing t1_j0fai9p wrote

In this case, I don’t think anyone can tell you wtf is going on without a copy of your code and dataset. There are just so many unknowns, but is this 1000 dim dense layer the last layer before a softmax?

Are you training the other layers then adding this new layer with new weight initialization in between the trained layers, or are you adding it in as a new architecture and re-initializing the weights everywhere and starting from scratch again?

5

100drunkenhorses t1_j0eir9u wrote

I enjoy this type of build. Depending on your work space. They sell extruded aluminum meant for GPU Mining. It's got room for eATX mobos and space for 2 big PSUs. They are spaced far enough apart that if you get the 3000 rpm noctua industrial fans and line them up you can cool that many 3090s on a single rig. If you are willing to cough up enough for 6 PCIe 4.0 x16 risers. Remember they are finicky at best so make sure you keep your warranty papers.

1

suflaj t1_j0dt970 wrote

Not really, 950 is smaller than 1000 so not only are you destroying information, but you are potentially getting into a really bad local minimum.

When you add that intermediate layer, what you are essentially doing is random hashing your previous distribution. If your random hash kills the relations between data your model learned, then of course it will not perform.

Now, because Xavier and Kaiming-He initializations aren't exactly initializations to get the functionality of a universal random hash, they might not kill all your relations, but they are still random enough to have the potential depending on the task and data. You might get lucky, but on average, you will almost never get lucky.

If I was in your place I would train with linear warmup to a fairly large learning rate, like 10x higher than previous maximum. This will make very bad weights shoot out of their bad minima once LR reaches the max and hopefully you'll get better results once they settle down as the LR falls down. Just make sure you clip your gradients so your weights don't go to NaN, because this is the equivalent of driving your car into a wall in hopes of the crash turning it into a Ferrari.

As for how long you should train it... Well, the best would be to add the layer without any nonlinear function and see how much you need to reach original performance. Since there is no non-linear function the new network is equally as expressive as the original. Once you get the number of epochs, add like 25% to that number and train the one with the non-linear transformation after your bottleneck that long.

5

suflaj t1_j0drngf wrote

It should, but how much it's tough to say and it depends on the rest of the model and where this bottleneck is. If, say, you're doing this is the first layers, the whole model basically has to be retrained from scratch, and performance similar to previous one is not guaranteed.

2

vade t1_j0dl85z wrote

I run 3x 3090 in a single case, without water cooling, but using one PCI riser and keeping the case open to allow for airflow. This is on a single 1600w PSU, no NVLink.

Anything more would be tough without a custom loop, and dual PSU.

Works great!

edit: I use a Fractal Design Design XL, and mount one 3090 FE vertically with a riser. Its janky but works.

1

Outrageous_Room_3167 OP t1_j0ddpsu wrote

>NVLink probably won't matter much since your CPU will be bottlenecked trying to send 5.2 TB/s of data to your GPUs. But again, there are no benchmarks to show how much, maybe the gains from NVLink will be noticeable.

I guess the bigger benefit of the NVLink is the larger memory, but aside from that, I don't think the performance gains are huge from what I've read. My thinking was to build out a chassis with external fans as well to cool everything down.

−1