Recent comments in /f/deeplearning

Ok_Firefighter_2106 t1_iy6t95h wrote

2,3

2:For example you use zero values for initialization, due to the symmetric nature of NN, now all neurons become the same, then the multi-layer NN is equal to a simple linear regression since the NN fails to break the symmetry. Therefore, is the problem is non-linear, the NN just can't learn.

​

3: as explained in other answers.

1

nutpeabutter t1_iy3z9lc wrote

There is indeed a non-zero gradient. However, symmetric initialization introduces a plethora of problems:

  1. The only way to break the symmetry is through the random biases. A fully symmetric network effectively means that individual layers act as a though they are a single weight(1 input 1 output layer), this means that it cannot learn complex functions until the symmetry is broken. Learning will thus be highly delayed as it has to first break the symmetry before being able to learn a useful function. This can explain the plateau at the start.
  2. Similar weights at the start, even if symmetry is broken, will lead to poor performance. It is easy to get trapped in local minima if your outputs are constrained due to your weights not having sufficient variance, there is a reason why weights are typically randomly initalized
  3. Random weights also allow for more "learning pathways" to be established, by pure chance alone, a certain combination of weights will be slightly more correct than others. The network can then abuse this to speed up it's learning, by changing it's other weights to support these pathways. Symmetric weights do not possess such an advantage.
6

canbooo t1_iy3pylo wrote

Bad initialization can be a problem if you do it yourself (i.e. bad scaling of weights) and if you are not using batch or other kinds of normalizations, since it might make your neurons die. E.g. a tanh neuron with too large input scale will only predict -1 or 1 for all data, which leads it to being dead, i.e. not learning anything due to 0 grad for the entire data set.

6

Own-Archer7158 t1_iy3mec9 wrote

Note that the minimal loss is reached when the parameters make neural network predictions the closest to the real labels

Before that, the gradient is non zero generally (except for an very very unlucky local minimum)

You could see the case of the linear regression with least square error as loss to understand better the underlying optimization problem (in one dimension, it is a square function to minimize, so no local minimum)

1

Own-Archer7158 t1_iy3m6j0 wrote

If all weight are the same (assume 0 to be simple) then the output of the function/neural network is far from the objective/label

The gradient is therefore non zero

And finally the parameters are updated : theta = theta + learning_rate*grad_theta(loss)

And when the parameters are updated the loss is changed

Usually, the parameters are randomly choosen

0

Own-Archer7158 t1_iy3h8pp wrote

If the learning rate is zero, the update rule of the params makes the params unchanged

The data balancing does not change the loss (it only changes the overfitting) and same for the regularization strength too low

Bad initialization is rarely a problem (with a lack of chance you could get a local minimum directly but rare event)

1

Rishh3112 t1_iy3dnqd wrote

Hey, I work for a start-up and the ai models are trained in a AWS cloud system. I suggest on having a AWS server since the company will be requiring host service later on, and it's easier to hold a cloud server and manage it. The security system of AWS is pretty good and hosting APIs is a lot easier with a cloud server. Even training model is a lot quicker since building a AWS standard system will be quite expensive and not just the system cost but the power consumption of the system will also be high. When considering about the cost between a AWS system and a physical system in office the factor of power consumption negates the monthly cost of a cloud system. Also a cloud system could be used on any system from around the world but for a physical system in office, you would require to setup a VPN to access and the system needs to be on power to use it whenever you want. AWS server will charge only while you are using the server with a minimal monthly charge. In comparison a physical system will be initially expensive and the cost of electricity, vpn and systems and room for cooling will cost you more than a AWS server in the longer run. Hope you found this helpful.

1

thefizzlee t1_iy38vtp wrote

I'm gonna assume the Nvidia A100 80gb edition is out of your budget but that is the gold standard for machine learning, they're usually deployed in clusters of 8 together but one is already better than 2 3090s for deeplearning.

If however you want to choose between 2 3090s or a 4090 and you're running into vram issues I'd go for the dual 3090, clustering gpus is very well supported in machine learning so you'll essentially getting double the performance and from what I know 1 4090 isn't faster than 2 3090s plus you'll double your vram

Edit: if you want to safe a buck and you're software supports it you could also look into the new Radeon rx 7900 xtx, as long a you don't need Cuda support or tensor cores

−1

--dany-- t1_iy29wqs wrote

I’m saying 2x 3090s are not much better than a 4090. According to lambda labs benchmarks a 4090 is about 1.3 to 1.9 times faster than a 3090. If you’re after speed then a 4090 definitely makes more sense as it’s only slightly slower but is much more power efficient and cheaper than 2x 3090s.

4

somebodyenjoy OP t1_iy23k2m wrote

I do hyperparameter tuning too, so the same model will have to train multiple times. More times the better, as I can try more architectures. So speed is important. But you’re saying that 4090 is not much better than 3090 in terms of speed huh

1

--dany-- t1_iy1zq6n wrote

3090 has NVLink bridge to connect two cards to pool memories together. Theoretically you’ll have 2x computing power and 48GB VRAM to do the job. If VRAM size is important for your big model and you have a beefy PSU then this is the way to go. Otherwise just go with a 4090.

If you don’t need to train a model frequently, colab or some paid gpu rental services might be easier for your wallet and power bill. For example it’s only about $2 per hour to rent 4x RTX A6000 from some rentals.

5