Recent comments in /f/MachineLearning

farmingvillein t1_jdnuvnf wrote

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

4

fiftyfourseventeen t1_jdnqlqc wrote

That's really cool, but I mean, it's published by Microsoft which is working with openAI, and it's a commerical closed source product. It's in their best interest to brag about it's capabilities as much as possible.

There are maybe sparks of AGI, but there are a lot of problems that are going to be very difficult to solve that people have been trying to solve for decades.

0

KarlKani44 t1_jdnosqr wrote

> Is the training curve you describe the only possible one for the critic loss?

Well, that's hard to say. If it works I wouldn't say it's wrong, but it would still make me think. Generally, in the case of WGAN, it's always a bit hard to say if the problem is a too strong generator or a too strong discriminator. With normal GANs, you see that the discriminator can differentiate very easily when you look at it's accuracy. With WGANs you can look at the distribution of output logits from the critic for real and generated samples. If the distribution is easily separatable, the discriminator is able to separate real from fake samples. During training the distribution of output logits should converge to look the same for both datasets.

From my experience and understanding: You want a very strong discriminator in WGAN training, since the gradient of its forward pass will still be very smooth because of the used lipschitz constraint (enforced through gradient penalty). This is also why you train it multiple times before a generator update. You want it to be very strong so the generator can use it as guidance. In vanilla GANs this would be a problem because the generator can not keep up. This is also why WGANs are easier to train. You don't have to keep this hard to achieve balance between the two networks.

If you look at the keras tutorial about WGAN-GP, their critic has 4.3M parameters, while the generator only has 900k. A vanilla GAN would not converge with models like this because the discriminator would be too strong. Their critic loss also starts at -7 and goes down very smoothly from there.

> Could this mean that the generator's job became easier due to normalisation

I would agree with this hypothesis. I'd say your critic is not able to properly tell the real samples from the generated ones right at the beginning. Probably the normalization helped the generator more than the critic. Try to make it stronger by scaling up the network or train it more often before updating the generator and see if the critic loss starts at negative values. Also try to do the before mentioned plot of the critic's output logits to see if the critic is able to separate real from fake at early epochs.

I haven't used scheduling with GANs before, but it might help. I would still try to get a stable training with nice looking output first and then try more tricks like scheduling and TTUR. With Adam I usually don't to any tricks like this though.

2

Blutorangensaft OP t1_jdnjfmu wrote

Thank you for the thorough answer. 1) I see, I will just trust my normalisation scheme then. 2) That makes sense. 3) Is the training curve you describe the only possible one for the critic loss? Because, with normalisation, I see the critic loss approaching 0 from a positive value. Could this mean that the generator's job became easier due to normalisation? Does it make sense to think about improving the critic then (like you described, with 3 times the params)? Also, I read about and tried scheduling, but I am using TTUR instead for its better convergence properties.

1

Impressive-Ad6400 t1_jdnjakm wrote

Uhm, this is probably incorrect as an analogy, but, do we humans actually need those 75 billion neurons on our brains?

I mean, there are lots of people who have lost a brain hemisphere for different reasons, and yet, they live happy lives.

However, what they lose is flexibility. This means they have a hard time when faced to new situations and have difficulties adapting to them.

I can't be certain, but it's possible that the number of parameters in large language models can account for their flexibility. That is why you can throw anything to chatGPT and it will answer, within the scope given by its restrictions.

I'm not sure either if enlarging the number of parameters will give us emergent properties or if it will only slow down data processing. Blue whales have immense brains, but they aren't necessarily smarter than us. And this is because a larger brain means larger distances for neurons to connect, slower response times and increased energetic expenditure.

I could be wrong, though. Electronic brains don't have the same limitations of physical brains, so maybe increasing their size won't affect their output.

4

KarlKani44 t1_jdng4it wrote

  1. Normalization is much older than GANs and I don't think there are papers that investigate the effect of this specifically for GANs. To find papers that generally look into the effect of normalization, you would probably have to go back to papers from the 90ties that experimented with small MLPs and CNNs. Normalization just helps with convergence in general, which oftentimes is problematic with GANs

  2. Normalization is not related to the activation function, since activations are done after at least one linear function which often includes a bias. This bias can easily shift your logits into any range, so the initial normalization doesn't have an effect on this. In my experience, a well designed GAN will converge with [-1, 1] just as well as with [0, 1], Making barely any difference. Just make sure you don't train with very high values (like [0, 255] for pixels). If my data is for example already in a range of something like [0, 1.5], I don't care about normalization that much.

  3. WGAN Critic loss starts with a high negative value and converges to zero from there. See the paper "Improved Training of Wasserstein GANs" Figure 5(a), where they plot "negative critic loss". Depending on your batch size and steps per epoch, you might see a positive critic loss at the beginning which quickly goes into a "high" negative before it starts to converge to zero.

Usually you want your critic loss to slowly converge to zero. If it goes down to zero very fast, it might still work but your generated samples are probably not optimal. Generally I'd track the quality of the samples with another additional metric. In case of images you can use something like FID. In case of simpler data (or simple images like MNIST) there are also metrics like MMD that give you an idea of your sample quality which you can again use for improving your training.

WGANs often work better if the discriminator is bigger than the generator (around 3x the parameters in my experience). If you think your networks are designed quite well already, the next thing I would play with is the number of critic updates that are done before the generator update. I've seen people go up to 25 with this number (original uses 5 I think). The other hyperparameter that I'd play with is learning rate of Adam, but usually keeping it the same for generator and critic.

3