Recent comments in /f/MachineLearning
farmingvillein t1_jdnuvnf wrote
Reply to comment by Disastrous_Elk_6375 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.
This is a misleading summary of the paper.
They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").
Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.
It is very much not just a paper showing that you can make it output in the same "style".
farmingvillein t1_jdntw7b wrote
Reply to comment by Sorry-Balance2049 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
pure marketing.
not even weights...due to the ToS issues with the fine-tune set, presumably.
SeymourBits t1_jdnttwn wrote
Reply to [D] What happens if you give as input to bard or GPT4 an ASCII version of a screenshot of a video game and ask it from what game it has been taken or to describe the next likely action or the input? by Periplokos
Interesting experiment. I have not done it but I predict it would hallucinate a well-documented video game screen, like Pac-man, and then describe probable actions within the hallucinated game.
light24bulbs t1_jdntdbb wrote
Reply to comment by baffo32 in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
I'm not hoping to do instruction tuning, i want to do additional pre-training.
I_will_delete_myself t1_jdnrr46 wrote
Reply to comment by Crystal-Ammunition in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
At that point we will run out of data. It will require more data efficient methods.
fiftyfourseventeen t1_jdnqlqc wrote
Reply to comment by nixed9 in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-
That's really cool, but I mean, it's published by Microsoft which is working with openAI, and it's a commerical closed source product. It's in their best interest to brag about it's capabilities as much as possible.
There are maybe sparks of AGI, but there are a lot of problems that are going to be very difficult to solve that people have been trying to solve for decades.
baffo32 t1_jdnppmp wrote
Reply to comment by light24bulbs in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
this is the same task as instruction tuning. instruction tuning just uses specific datasets where instructions are followed. it‘s called “finetuning” but nowadays people are using adapters and peft to do this on low end systems.
KarlKani44 t1_jdnosqr wrote
Reply to comment by Blutorangensaft in [D]: Different data normalisation schemes for GANs by Blutorangensaft
> Is the training curve you describe the only possible one for the critic loss?
Well, that's hard to say. If it works I wouldn't say it's wrong, but it would still make me think. Generally, in the case of WGAN, it's always a bit hard to say if the problem is a too strong generator or a too strong discriminator. With normal GANs, you see that the discriminator can differentiate very easily when you look at it's accuracy. With WGANs you can look at the distribution of output logits from the critic for real and generated samples. If the distribution is easily separatable, the discriminator is able to separate real from fake samples. During training the distribution of output logits should converge to look the same for both datasets.
From my experience and understanding: You want a very strong discriminator in WGAN training, since the gradient of its forward pass will still be very smooth because of the used lipschitz constraint (enforced through gradient penalty). This is also why you train it multiple times before a generator update. You want it to be very strong so the generator can use it as guidance. In vanilla GANs this would be a problem because the generator can not keep up. This is also why WGANs are easier to train. You don't have to keep this hard to achieve balance between the two networks.
If you look at the keras tutorial about WGAN-GP, their critic has 4.3M parameters, while the generator only has 900k. A vanilla GAN would not converge with models like this because the discriminator would be too strong. Their critic loss also starts at -7 and goes down very smoothly from there.
> Could this mean that the generator's job became easier due to normalisation
I would agree with this hypothesis. I'd say your critic is not able to properly tell the real samples from the generated ones right at the beginning. Probably the normalization helped the generator more than the critic. Try to make it stronger by scaling up the network or train it more often before updating the generator and see if the critic loss starts at negative values. Also try to do the before mentioned plot of the critic's output logits to see if the critic is able to separate real from fake at early epochs.
I haven't used scheduling with GANs before, but it might help. I would still try to get a stable training with nice looking output first and then try more tricks like scheduling and TTUR. With Adam I usually don't to any tricks like this though.
[deleted] t1_jdnmu4i wrote
Reply to comment by alrunan in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
[deleted]
MrFlamingQueen t1_jdnmkby wrote
Reply to comment by drinkingsomuchcoffee in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
🤫🤫 Shhhhh, this is my research area.
machineko t1_jdnmg8l wrote
Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Right, 8GB won't be enough for LLaMA 7b. You should try GPT-2 model. That should work on 8GB VRAM.
pornthrowaway42069l t1_jdnmf0j wrote
Reply to comment by currentscurrents in [N] GPT-4 has 1 trillion parameters by mrx-ai
I'm confused, how is that different from what I said? Maybe I worded my response poorly, but I meant that we should focus on smaller models, rather than those gigantic ones.
nixed9 t1_jdnm1qx wrote
Reply to comment by fiftyfourseventeen in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-
> Sparks of Artificial General Intelligence: Early experiments with GPT-4
No_Confusion_5493 t1_jdnksi8 wrote
Great great and great thanks for this post
learn-deeply t1_jdnkaw7 wrote
Reply to comment by nekize in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501
If you need to pad your paper, that means there hasn't been enough original research done.
Blutorangensaft OP t1_jdnjfmu wrote
Reply to comment by KarlKani44 in [D]: Different data normalisation schemes for GANs by Blutorangensaft
Thank you for the thorough answer. 1) I see, I will just trust my normalisation scheme then. 2) That makes sense. 3) Is the training curve you describe the only possible one for the critic loss? Because, with normalisation, I see the critic loss approaching 0 from a positive value. Could this mean that the generator's job became easier due to normalisation? Does it make sense to think about improving the critic then (like you described, with 3 times the params)? Also, I read about and tried scheduling, but I am using TTUR instead for its better convergence properties.
Impressive-Ad6400 t1_jdnjakm wrote
Uhm, this is probably incorrect as an analogy, but, do we humans actually need those 75 billion neurons on our brains?
I mean, there are lots of people who have lost a brain hemisphere for different reasons, and yet, they live happy lives.
However, what they lose is flexibility. This means they have a hard time when faced to new situations and have difficulties adapting to them.
I can't be certain, but it's possible that the number of parameters in large language models can account for their flexibility. That is why you can throw anything to chatGPT and it will answer, within the scope given by its restrictions.
I'm not sure either if enlarging the number of parameters will give us emergent properties or if it will only slow down data processing. Blue whales have immense brains, but they aren't necessarily smarter than us. And this is because a larger brain means larger distances for neurons to connect, slower response times and increased energetic expenditure.
I could be wrong, though. Electronic brains don't have the same limitations of physical brains, so maybe increasing their size won't affect their output.
drinkingsomuchcoffee t1_jdnhxri wrote
Huge models are incredibly wasteful and unoptimized. Someday, someone is going to sit down and create an adaptive algorithm that expands or contracts a model during the training phase and we're going to laugh at how stupid we were.
fiftyfourseventeen t1_jdnhbn0 wrote
Reply to comment by Yardanico in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
OpenAI is also doing a lot of tricks behind the scenes, so it's not really fair to just type two things into both, because they are getting nowhere near the same prompt. Llama is promising but it just needs to be properly instruction tuned
fiftyfourseventeen t1_jdngwum wrote
Reply to comment by wrossmorrow in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
Eh.... Not really, that's training a low rank representation of the model, not actually making it smaller.
KarlKani44 t1_jdng4it wrote
-
Normalization is much older than GANs and I don't think there are papers that investigate the effect of this specifically for GANs. To find papers that generally look into the effect of normalization, you would probably have to go back to papers from the 90ties that experimented with small MLPs and CNNs. Normalization just helps with convergence in general, which oftentimes is problematic with GANs
-
Normalization is not related to the activation function, since activations are done after at least one linear function which often includes a bias. This bias can easily shift your logits into any range, so the initial normalization doesn't have an effect on this. In my experience, a well designed GAN will converge with [-1, 1] just as well as with [0, 1], Making barely any difference. Just make sure you don't train with very high values (like [0, 255] for pixels). If my data is for example already in a range of something like [0, 1.5], I don't care about normalization that much.
-
WGAN Critic loss starts with a high negative value and converges to zero from there. See the paper "Improved Training of Wasserstein GANs" Figure 5(a), where they plot "negative critic loss". Depending on your batch size and steps per epoch, you might see a positive critic loss at the beginning which quickly goes into a "high" negative before it starts to converge to zero.
Usually you want your critic loss to slowly converge to zero. If it goes down to zero very fast, it might still work but your generated samples are probably not optimal. Generally I'd track the quality of the samples with another additional metric. In case of images you can use something like FID. In case of simpler data (or simple images like MNIST) there are also metrics like MMD that give you an idea of your sample quality which you can again use for improving your training.
WGANs often work better if the discriminator is bigger than the generator (around 3x the parameters in my experience). If you think your networks are designed quite well already, the next thing I would play with is the number of critic updates that are done before the generator update. I've seen people go up to 25 with this number (original uses 5 I think). The other hyperparameter that I'd play with is learning rate of Adam, but usually keeping it the same for generator and critic.
Daveboi7 t1_jdnf96e wrote
Reply to comment by dreamingleo12 in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Somebody posted results on Twitter, they looked pretty good. I don’t think he worked for DB either. But who knows really
dreamingleo12 t1_jdnf4qn wrote
Reply to comment by Daveboi7 in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
I don’t trust DB’s results tbh. LLaMA is a better model than GPT-J.
Daveboi7 t1_jdnf18o wrote
Reply to comment by dreamingleo12 in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Yeah but the comparisons I have seen between Dolly and Alpaca look totally different.
Somehow the Dolly answers look much better imo
Edit: spelling
Western-Image7125 t1_jdnvnu7 wrote
Reply to comment by Snoo58061 in [D] "Sparks of Artificial General Intelligence: Early experiments with GPT-4" contained unredacted comments by QQII
Well your last line kinda makes the same point as the other person you are debating with? What if we are getting really close to actual intelligence, even though it is nothing like biological intelligence which is the only kind we know of