Recent comments in /f/deeplearning

peder2tm t1_j0xtfsu wrote

I have seen 10xRTX3090 in a single rack mounted server node with 2x40 core Intel CPU. This is a university setup and nodes are connected with infiniband and managed with slurm.
If you need to mount 10 rtx3090 in the same node, you must get ones with blower style fans to get the heat out and get the most powerful case fans you can get.

1

sigmoid_amidst_relus t1_j0wsqyz wrote

3090 is not as good as an A100 in terms of pure performance.

It's much better than an A100 in perf/$

A single consumer-grade deep learning node won't scale past 3x 3090s without diminishing returns until and unless all you work with are datasets that fit in your memory or have a great storage solution. Top end prosumer and server grade platforms will do fine with up to 4-6x in a non-rack mounted setting, but not without custom cooling. The problem isn't just how well you can feed the gpus; 3090s are simply not designed to work at such high node densities like server end cards are. That's why companies are happy to pay pretty penny for A100s and other server grade cards (even if we ignore the need for certifications and Nvidia mandates): infrastructure and running costs of a good quality server facility far outweigh GPU costs and money lost to potential downtime.

Connecting multi-node setups is done through high bandwidth interconnects, like mellanox infiniband stuff.

Most mining farms don't run GPUs on full pcie x16 as mining isn't memory intensive, so you're not going to scale as well as that.

You can very well scale to 64x GPU "farm" easily, but it's going to be a pain in a consumer-grade only setup, esp in terms of interconnects and stuff, not to mention terribly space and cooling inefficient.

3

sayoonarachu t1_j0wjj7v wrote

You could probably look at the 11th gen legion 7i which is cheaper than their new 12th gen ones. They're not 3080TI but the difference between 3080 and 3080 ti, last I check was very minimal like 5% performance difference.

I personally have the 11th gen version after comparing a bunch of gaming laptops and use it for programming in Unreal Engine and deep learning and playing with Stable Diffusion, etc. Main pro? Like you said, the looks. I love the simple minimal non gaming laptop appeal of the legions. 😅

Also, you'd probably wanna research if all the laptops you've listed are actually able to run the 3080s at their max rating of 150w (previously known as max-q i believe). Some oems won't advertise it. The legion 7i 3080s are though.

1

Logon1028 t1_j0vg2lk wrote

In theory, if the CNN needs an edge detection filter then it will learn it through training the weights. Yes, adding known filters can sometimes improve performance if you know your dataset extremely well. But humans are honestly really bad at programming complex detection tasks like these. The network might not even need those known filters. At which point you are just wasting computation time. Majority of the time its better to just let the network do its thing and learn the filters itself.

6

vortexminion t1_j0rkirh wrote

I'd go to Stackoverflow for help with stuff like this. Also, a screenshot alone is also impossible to diagnose. You'll need to upload the error log, code, and detailed explanation of the problem if you want anyone to give you any useful advice. I have no idea what your problem is because you've provided nothing that would help me narrow it down.

After thought, I think you mean "convolution"

1

Logon1028 OP t1_j0mcyle wrote

What I ended up doing is using np.indices (multiplied by the stride) to apply a mask to the x and y argmax arrays using an elementwise multiplication. Then I used elementwise division and modulus to calculate the input indexes myself. The only for loop I have in the forward pass now is a simple one for the depth of the input. The backward pass still uses a triple for loop, but I can live with that.

The model I showed in the previous comment now trains in just under 4 minutes. So now I have a roughly 3x performance increase from my original implementation. And I think that is where I am going to leave it.

Thank you for your help. Even though I didn't use all your suggestions directly, it definitely guided me in the right direction. My current implementation is FAR more efficient than any examples I could find online unfortunately.

1

Logon1028 OP t1_j0mbuxc wrote

Yes, but that unravel_indices has to be applied to EVERY SINGLE ELEMENT of the last axis independently. i.e.

        for depth in range(strided_result.shape[0]):
            for x in range(strided_result.shape[1]):
                for y in range(strided_result.shape[2]):
                    local_stride_index = np.unravel_index(argmax_arr[depth][x][y], strided_result[depth][x][y].shape)

unravel_indices only takes a 1d array as input. In order to apply it to only the last axis of the 4D array you have to use a 4 loop. unravel_indices has no axis parameter.

1

elbiot t1_j0m3a67 wrote

Huh?

idx = unravel_indices(indices, shape) Values=arr[*idx]

No loop required. If you're referring to the same loop you were using to get the argmax, you can just adjust your indices first so they apply to the unstrided array

1

Logon1028 OP t1_j0lxkae wrote

That's what I am doing currently. But I have to unpack it in a triple nested for loop because numpy doesn't accept tuples. So I don't gain the benefits of numpy's parallelization. Which is why I was searching for a possible alternative. I am not trying to like super optimize this function, but I want all the low hanging fruit I can get. I want people to be able to use the library to train small models for learning purposes.

1