matus_pikuliak OP t1_je0am88 wrote on March 28, 2023 at 2:24 PM

Reply to comment by rshah4 in [P] ChatGPT Survey: Performance on NLP datasets by matus_pikuliak

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

TehDing t1_je08lpg wrote on March 28, 2023 at 2:10 PM

Reply to comment by sebzim4500 in [P] two copies of gpt-3.5 (one playing as the oracle, and another as the guesser) performs poorly on the game of 20 Questions (68/1823). by evanthebouncy

You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

In terms of 'solving' a game as this 20 questions example, there are only 12000 valid words to guess from, or at worst 26^5 possible answers, which still makes this a smaller example (or at worst case on par) as the blog experiment.

Want an easier game? Sucks at Hangman too. It'll guess in terms of frequency, but not well enough to bring together a word. Even guessing on the basis of common ngrams would probably be a good enough strategy.

My experience is that LLMs are poor in terms of novel reasoning. This makes sense, RFHL isn't giving these things a consciousness. Maybe with tweaks/ tools we'll actually see some "thinking", but for now (this may change next week at the rate things are going) it's not very good at games in general as a result (another example: I haven't tried it with GPT4, but GPT3 cheats at chess).

[deleted] t1_je080cm wrote on March 28, 2023 at 2:06 PM

Reply to Variance in reported results on ImageNet between papers [D] by kaphed

[deleted]

Impallion t1_je07in1 wrote on March 28, 2023 at 2:02 PM

Reply to comment by [deleted] in [D] FOMO on the rapid pace of LLMs by 00001746

I completely agree and of the things that u/nxqv listed, I think impact is the thing that most everyday people want and fear they will no longer have, more so than fame, riches, clout etc. It's totally natural to want the things you spend effort on to have impact.

Now what I'm more interested in is the argument of how much impact is enough to make you feel satisfied, and I think this is where the FOMO starts to set in for people. People want to have a "large" impact - making company-wide differences, influence large swaths of people. I think the fear is that in the face of a ChatGPT, your little model or little application can only reach a handful of others.

Extrapolate current trends and you might think, oh well AI applications are just going to get bigger and bigger. Midjourney 5 or SuperChatGPT-12 are going to be so insanely capable that we will have no more use for human writing, human art, human music, human programming. There will simply be no more room for my work to EVER have a big impact in the future. (Maybe this change is also similar to how the scientific greats back in the day could discover big theorems like Einstein's relativity, but nowadays you need to hyper-specialize in academia to produce results for your tiny corner)

My solution is that we need to dig a little deeper. What does it mean to be human? What does it mean to live a good meaningful life? If your answer to that is that a good life worth living is one where you impact on the order of thousands or millions of humans, then yes we might be shifting away from that possibility. But humans are built for connection, and I think we will need to look inwards and realize that we don't need to influence thousands to experience that connection. You can make a little model or application that affects hundreds. You can write a song just for your friends and family. You can paint a piece of art that just hangs on your wall and gets a single compliment. To me that is already human connection, and is just as meaningful as making a large model that drives the next Google/Meta forward.

neonwatty t1_je06w3v wrote on March 28, 2023 at 1:58 PM

Reply to comment by abc220022 in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

absolutely

VodkaHaze t1_je06t03 wrote on March 28, 2023 at 1:57 PM

Reply to comment by rfxap in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

> LeetCode problems that were published after the training data cutoff date

A variation of those problems is likely on github before they're posted?

[deleted] t1_je06cl7 wrote on March 28, 2023 at 1:54 PM

Reply to Variance in reported results on ImageNet between papers [D] by kaphed

[deleted]

machineko t1_je05orp wrote on March 28, 2023 at 1:50 PM

Reply to comment by rshah4 in [D] FOMO on the rapid pace of LLMs by 00001746

I agree. While these giant centralized models are all over the news, there are ways to make smaller models much more efficient (i.e. LoRA mentioned above). And during the process working with these techniques, we can perhaps discover new methods and architecture .

We are working on an open-source project focused on making fine-tuning for LLMs, simple, fast and efficient: https://github.com/stochasticai/xturing.

OP, we till got a ton of stuff we want to try out to make fine-tuning faster and more compute/memory efficient, if you are interested in contributing.

truchisoft t1_je05j63 wrote on March 28, 2023 at 1:48 PM

Reply to [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

The funny thing about these posts is that this is clearly propaganda aimed to low effort people.

Anyone caring about this is either blinded by their own prejudice or just too dumb to even try GPT once themselves.

Everyone else does not need someone telling them that even GPT3.5 is incredible for coding (and a lot of other stuff), it is not perfect but it goes a long way, heck, I was even able to make a simple game in less than 3 hours using 99% GPT3.5 code and DALL-E sprites.

krali_ t1_je053hd wrote on March 28, 2023 at 1:45 PM

Reply to comment by sigmatrophic in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

I'm considering it, if only for plugin support. Wolfram in particular.

WokeAssBaller t1_je04bbu wrote on March 28, 2023 at 1:40 PM

Reply to comment by lambertb in [D] GPT4 and coding problems by enryu42

I’m and MLE and I’ve used it a bunch, it’s hardly ever actually useful. It gets close but it’s not there and it’s faster to google almost every time.

It will be useful in probably a year or two, but it needs to understand how to run its own experiments. Anyone who actually thinks this is useful right now is just buying hype

Wtiaw t1_je04aq2 wrote on March 28, 2023 at 1:39 PM

Reply to [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

> Note that GPT-4 cannot access the Internet, so memorization is the only explanation

this is not true, it was shown through jailbreaks that it could access the internet

lambertb t1_je02zgn wrote on March 28, 2023 at 1:29 PM

Reply to comment by WokeAssBaller in [D] GPT4 and coding problems by enryu42

Have you used the tools yourself? I have, and a 40% increase in productivity is totally plausible, and often an underestimate considering I can now do things I would not have even tried previously. I encourage you to try them, with healthy skepticism and an open mind.

eamonious t1_je02go2 wrote on March 28, 2023 at 1:25 PM

Reply to comment by Headz0r in [P] two copies of gpt-3.5 (one playing as the oracle, and another as the guesser) performs poorly on the game of 20 Questions (68/1823). by evanthebouncy

I work with a database that draws off experimental data on response times in human trials from Wash U St. Louis. Alternatively you can use standardized grade level vocab lists that exist in a number of states. Frequency data is also associated.

Obviously there’s no true silver bullet to defining it, but I think we all have some intuition or recognize a degree of objectivity to what a reasonably correct ordering of a random selection of words would look like based on our understanding of language (including two-word terms like “ice cream”, idiomatic phrases like “rain cats and dogs”, or borrowed expressions like “deja vu” and “savoir faire”). Which in my mind means GPT should also be able to achieve an intuition. I encourage people to try this with GPT, it doesn’t perform well (at least by any human intuition standard) in my experience.

What’s interesting to me is the possibility that the model defines the “difficulty of words” as it itself experiences them. Words that are for whatever reason more “difficult” for the model itself to assess.

Sorry, I’ll try to report back with something more concrete.

ninjasaid13 t1_je02b7m wrote on March 28, 2023 at 1:24 PM

Reply to [P] Consistency: Diffusion in a Single Forward Pass 🚀 by Beautiful-Gur-9456

$peed?

muskoxnotverydirty t1_je027xh wrote on March 28, 2023 at 1:24 PM

Reply to comment by bjj_starter in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

rshah4 t1_je00t24 wrote on March 28, 2023 at 1:13 PM

Reply to comment by crazyvaclav3 in [D] FOMO on the rapid pace of LLMs by 00001746

Here is my video: https://youtu.be/YKCtbIJC3kQ
Here is the blog post its based on: https://www.philschmid.de/fine-tune-flan-t5-peft
Efficient Large Language Model training with LoRA and Hugging Face

I should also post in ML - I will do that later today

nxqv t1_je00ks4 wrote on March 28, 2023 at 1:11 PM

Reply to comment by [deleted] in [D] FOMO on the rapid pace of LLMs by 00001746

Also, don't lose sight of the forest because of a tree. We're talking about impact in the context of FOMO - if you feel that level of anxiety and rush about potentially missing out on the ability to make an impact because others are already making the impact you want to make, it's more likely to be ego-driven than genuine altruism

nxqv t1_je006pc wrote on March 28, 2023 at 1:08 PM

Reply to comment by ginsunuva in [D] FOMO on the rapid pace of LLMs by 00001746

Yeah "legacy" is another one of those ego-loaded words that doesn't always mean what it looks like it means.

Gunhild t1_jdzzw8j wrote on March 28, 2023 at 1:05 PM

Reply to comment by thelastpizzaslice in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

Clearly a sign of intelligence; even the AI knows you don't mess with perfection.

wazis t1_jdzzs1q wrote on March 28, 2023 at 1:04 PM

Reply to comment by jrkirby in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

Well they can, but it is expensive

light24bulbs t1_jdzzeh4 wrote on March 28, 2023 at 1:01 PM

Reply to comment by nemorocksharder in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

Yes, I'm into it now. Code like this can be adapted to load bulk data instead of q&a.

I suspect some of the training parameters need to be adjusted a bit to prevent over fitting and obviously the data loading and templating needs to be removed.

https://github.com/lxe/llama-tune Or for a cooler approach where you make a Lora layer https://github.com/serp-ai/LLaMA-8bit-LoRA

CollectionLeather292 t1_jdzzazr wrote on March 28, 2023 at 1:00 PM

Reply to comment by wazis in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

Tl:dr

Beautiful-Gur-9456 OP t1_jdzzagc wrote on March 28, 2023 at 1:00 PM

Reply to comment by CyberDainz in [P] Consistency: Diffusion in a Single Forward Pass 🚀 by Beautiful-Gur-9456

That's the generated samples recorded every 10 epochs during training, not the denoising process. It does look like deblurring though 😊

rshah4 t1_jdzyo8u wrote on March 28, 2023 at 12:55 PM

Reply to [P] ChatGPT Survey: Performance on NLP datasets by matus_pikuliak

Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).

I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).

https://arxiv.org/pdf/2209.12356.pdf https://arxiv.org/pdf/2301.13848.pdf

Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.

Recent comments in /f/MachineLearning