Recent comments in /f/MachineLearning

matus_pikuliak OP t1_je0am88 wrote

Only some papers used few-shot prompting, and it was usually beneficial and sometimes it helped to beat the SOTA.

Yeah, OpenAI definitely does not care about these benchmarks, but I think they are still useful to see how capable the models are. I find it hard to imagine that the models could be used in some applications if they can not reliably do the even simple tasks evaluated by these benchmarks.

1

TehDing t1_je08lpg wrote

You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

In terms of 'solving' a game as this 20 questions example, there are only 12000 valid words to guess from, or at worst 26^5 possible answers, which still makes this a smaller example (or at worst case on par) as the blog experiment.

Want an easier game? Sucks at Hangman too. It'll guess in terms of frequency, but not well enough to bring together a word. Even guessing on the basis of common ngrams would probably be a good enough strategy.

My experience is that LLMs are poor in terms of novel reasoning. This makes sense, RFHL isn't giving these things a consciousness. Maybe with tweaks/ tools we'll actually see some "thinking", but for now (this may change next week at the rate things are going) it's not very good at games in general as a result (another example: I haven't tried it with GPT4, but GPT3 cheats at chess).

1

Impallion t1_je07in1 wrote

I completely agree and of the things that u/nxqv listed, I think impact is the thing that most everyday people want and fear they will no longer have, more so than fame, riches, clout etc. It's totally natural to want the things you spend effort on to have impact.

Now what I'm more interested in is the argument of how much impact is enough to make you feel satisfied, and I think this is where the FOMO starts to set in for people. People want to have a "large" impact - making company-wide differences, influence large swaths of people. I think the fear is that in the face of a ChatGPT, your little model or little application can only reach a handful of others.

Extrapolate current trends and you might think, oh well AI applications are just going to get bigger and bigger. Midjourney 5 or SuperChatGPT-12 are going to be so insanely capable that we will have no more use for human writing, human art, human music, human programming. There will simply be no more room for my work to EVER have a big impact in the future. (Maybe this change is also similar to how the scientific greats back in the day could discover big theorems like Einstein's relativity, but nowadays you need to hyper-specialize in academia to produce results for your tiny corner)

My solution is that we need to dig a little deeper. What does it mean to be human? What does it mean to live a good meaningful life? If your answer to that is that a good life worth living is one where you impact on the order of thousands or millions of humans, then yes we might be shifting away from that possibility. But humans are built for connection, and I think we will need to look inwards and realize that we don't need to influence thousands to experience that connection. You can make a little model or application that affects hundreds. You can write a song just for your friends and family. You can paint a piece of art that just hangs on your wall and gets a single compliment. To me that is already human connection, and is just as meaningful as making a large model that drives the next Google/Meta forward.

2

machineko t1_je05orp wrote

I agree. While these giant centralized models are all over the news, there are ways to make smaller models much more efficient (i.e. LoRA mentioned above). And during the process working with these techniques, we can perhaps discover new methods and architecture .

We are working on an open-source project focused on making fine-tuning for LLMs, simple, fast and efficient: https://github.com/stochasticai/xturing.

OP, we till got a ton of stuff we want to try out to make fine-tuning faster and more compute/memory efficient, if you are interested in contributing.

6

truchisoft t1_je05j63 wrote

The funny thing about these posts is that this is clearly propaganda aimed to low effort people.

Anyone caring about this is either blinded by their own prejudice or just too dumb to even try GPT once themselves.

Everyone else does not need someone telling them that even GPT3.5 is incredible for coding (and a lot of other stuff), it is not perfect but it goes a long way, heck, I was even able to make a simple game in less than 3 hours using 99% GPT3.5 code and DALL-E sprites.

−10

WokeAssBaller t1_je04bbu wrote

Reply to comment by lambertb in [D] GPT4 and coding problems by enryu42

I’m and MLE and I’ve used it a bunch, it’s hardly ever actually useful. It gets close but it’s not there and it’s faster to google almost every time.

It will be useful in probably a year or two, but it needs to understand how to run its own experiments. Anyone who actually thinks this is useful right now is just buying hype

1

lambertb t1_je02zgn wrote

Have you used the tools yourself? I have, and a 40% increase in productivity is totally plausible, and often an underestimate considering I can now do things I would not have even tried previously. I encourage you to try them, with healthy skepticism and an open mind.

1

eamonious t1_je02go2 wrote

I work with a database that draws off experimental data on response times in human trials from Wash U St. Louis. Alternatively you can use standardized grade level vocab lists that exist in a number of states. Frequency data is also associated.

Obviously there’s no true silver bullet to defining it, but I think we all have some intuition or recognize a degree of objectivity to what a reasonably correct ordering of a random selection of words would look like based on our understanding of language (including two-word terms like “ice cream”, idiomatic phrases like “rain cats and dogs”, or borrowed expressions like “deja vu” and “savoir faire”). Which in my mind means GPT should also be able to achieve an intuition. I encourage people to try this with GPT, it doesn’t perform well (at least by any human intuition standard) in my experience.

What’s interesting to me is the possibility that the model defines the “difficulty of words” as it itself experiences them. Words that are for whatever reason more “difficult” for the model itself to assess.

Sorry, I’ll try to report back with something more concrete.

1

muskoxnotverydirty t1_je027xh wrote

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

11

nxqv t1_je00ks4 wrote

Also, don't lose sight of the forest because of a tree. We're talking about impact in the context of FOMO - if you feel that level of anxiety and rush about potentially missing out on the ability to make an impact because others are already making the impact you want to make, it's more likely to be ego-driven than genuine altruism

1

light24bulbs t1_jdzzeh4 wrote

Yes, I'm into it now. Code like this can be adapted to load bulk data instead of q&a.

I suspect some of the training parameters need to be adjusted a bit to prevent over fitting and obviously the data loading and templating needs to be removed.

https://github.com/lxe/llama-tune Or for a cooler approach where you make a Lora layer https://github.com/serp-ai/LLaMA-8bit-LoRA

1

rshah4 t1_jdzyo8u wrote

Nice work! -- How were the results when comparing using ChatGPT zero shot versus few shot? I have noticed that when using LLMs, you can get an improvement by using few shot learning with LLMs (giving it a few examples in the prompts).

I am not surprised for traditional NLP tasks that we don't see much of an improvement over GPT-3. It seems much of the focus from OpenAI is not on these benchmarks but on trying to make the results more useful to people (all the Instruction tuning / RLHF work).

https://arxiv.org/pdf/2209.12356.pdfhttps://arxiv.org/pdf/2301.13848.pdf

Also, for real-world use, it's not necessary that ChatGPT beats a fine-tuned SOTA model. ChatGPT is much easier to use than having to fine-tune a more traditional model.

1