Submitted by Vegetable-Skill-9700 t3_121agx4 in deeplearning
BellyDancerUrgot t1_jdldmda wrote
Reply to comment by FirstOrderCat in Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
Funny cuz , I keep seeing people rave like madmen over gpt4 and chatgpt and I’ve had a 50-50 hit rage wrt good results or hallucinated bullshit with both of them. Like it isn’t even funny. People think it’s going to replace programmers and doctors meanwhile it can’t do basic shit like cite the correct paper.
Of course it aces tests and leetcode problems it was trained on. It was trained on basically the entire internet. How do you even get an unbiased estimate of test error?
Doesn’t mean it isn’t impressive. It’s just one huge block of really good associative memory. Doesn’t even begin to approach the footholds of AGI imo. No world model. No intuition. Just memory.
suflaj t1_jdlruqe wrote
Could you share those questions it supposedly hallucinated on? I have not see it hallucinate EVER on new chats, only when the hallucination was based on that chat's hiatory.
> Of course it aces tests and leetcode problems it was trained on.
It does not ace leetcode. This statement casts doubt about your capabilities to objectively evaluate it.
> How do you even get an unbiased estimate of test error?
First you need to define unbiased. If unbiased means no direct dataset leak, then the existing evaluation is already done like that.
> Doesn’t even begin to approach the footholds of AGI imo.
Seems like you're getting caught on the AI effect. We do not know if associative memory is insufficient to reach AGI.
> No world model. No intuition.
Similarly, we do not know if those are necessary for AGI. Furthermore, I would dare you to define intuition, because depending on your answer, DL models inherently have that.
BellyDancerUrgot t1_jdns4yg wrote
- 
Paper summarization and factual analysis of 3d generative models, basic math, basic oops understanding were the broad topics I experimented it on. Not giving u the exact prompts but you are free to evaluate it yourselves. 
- 
Wrong choice of words on my part. When I said ‘ace’ I implied that It does really good on leetcode questions from before 2021 and it’s abysmal after. Also the ones it does solve it solves at a really fast rate. From a test that happened a few weeks ago it solved 3 questions pretty much instantly and that itself would have placed it in the top 10% of competitors. 
- 
Unbiased implies being tested on truly unseen data which there is far less off considering the size of the train data used. Many of the examples cited in their new paper “sparks of agi” are not even reproducible. 
https://twitter.com/katecrawford/status/1638524011876433921?s=46&t=kwpwSgfnJvGe6J-1CEe_5Q
- 
Insufficient because as I said , no world model, no intuition, only memory. Which is why it hallucinates. 
- 
Intuition is understanding the structure of the world without having to have the entire internet to memorize it. A good analogy would be of how a child isnt taught how gravity works when they first start walking. Or how you can not have knowledge about a subject and still infer based on your understanding of underlying concepts. 
These are things u can inherently not test or quantify when evaluating models like gpt that have been trained on everything and you still don’t know what it has been trained on lol.
- 
You can keep daring me and idc because I have these debates with fellow researchers in the field, always looking for a good debate if I have time. I’m not even an NLP researcher and even then I know the existential dread creeping in on NLP researchers because of how esoteric these results are and how AI influencers have blown things out of proportion citing cherry picked results that aren’t even reproducible because you don’t know how to reproduce them. 
- 
There is no real way an unbiased scientist reads openAIs new paper on sparks of AGI and goes , “oh look gpt4 is solving AGI”. 
- 
Going back on what I said earlier, yes there is always the possibility that I’m wrong and GPT is indeed the stepping stone to AGI but we don’t know because the only results u have access to are not very convincing. And on a user level it has failed to impress me beyond being a really good chatbot which can do some creative work. 
suflaj t1_jdnvq8q wrote
> Not giving u the exact prompts
Then we will not be able to verify your claims. I hope you don't expect others (especially those with a different experience, challenging your claims) to carry your burden of proof.
> When I said ‘ace’ I implied that It does really good on leetcode questions from before 2021 and it’s abysmal after.
I have not experienced this. Could you provide the set of problems you claim this is the case for?
> Also the ones it does solve it solves at a really fast rate.
Given its architecture, I do not believe this is actually the case. Its inference is only reliant on the output length, not the problem difficulty.
> From a test that happened a few weeks ago it solved 3 questions pretty much instantly and that itself would have placed it in the top 10% of competitors.
That does not seem to fit my definition of acing it. Acing is being able to solve all or most question. Given a specific year, that is not equal to being able to solve 3 problems. Also, refer to above paragraph about why inference speed is meaningless.
Given that it is generally unknown what it was trained on, I don't think it's even adequate to judge its performance on long-known programming problems.
> Insufficient because as I said , no world model, no intuition, only memory. Which is why it hallucinates.
You should first cite some authority on why it would be important. We generally do not even know what it would take to prevent hallucination, since we humans, who have that knowledge, often hallucinate as well.
> Intuition is understanding the structure of the world without having to have the entire internet to memorize it.
So why would that be important? Also, the world you're looking for is generalizing, not intuition. Intuition has nothing to do with knowledge, it is at most loosely tied to wisdom.
I also fail to understand why such a thing would be relevant here. First, no entity we know of (other than God) would possess this property. Secondly, if you're alluding that GPT- like models have to memorize something to know, you are deluding yourself - GPT-like models memorize relations, they are not memory networks.
> A good analogy would be of how a child isnt taught how gravity works when they first start walking.
This is orthogonal to your definition. A child does not understand gravity. No entity we know of understands gravity, we at most understand its effects to some extent. So it's not a good analogy.
> Or how you can not have knowledge about a subject and still infer based on your understanding of underlying concepts.
This is also orthogonal to your definition. Firstly it is fallacious in the sense that we cannot even know what is objective truth (and so it requires a very liberal definition of "knowledge"), and secondly you do not account for correct inference by chance (which does not require understanding). Intuition, by a general definition, has little to do with (conscious) understanding.
> These are things u can inherently not test or quantify when evaluating models like gpt that have been trained on everything and you still don’t know what it has been trained on lol.
First you should prove that these are relevant or wanted properties for whatever it is you are describing. In terms of AGI, it's still unknown what would be required to achieve it. Certainly it is not obvious how intuition, however you define it, is relevant for it.
> I’m not even an NLP researcher and even then I know the existential dread creeping in on NLP researchers because of how esoteric these results are and how AI influencers have blown things out of proportion citing cherry picked results that aren’t even reproducible because you don’t know how to reproduce them.
Brother, you just did an ad hominem on yourself. These statements only suggest you are not qualified to talk about this. I have no need to personally attack you to talk with you (not debate), so I would prefer if you did not trivialize your standpoint. For the time being, I am not interested in the validity of it - first I'm trying to understand what exactly you are claiming, as you have not provided a way for me to reproduce and check your claims (which are contradictory to my experience).
> There is no real way an unbiased scientist reads openAIs new paper on sparks of AGI and goes , “oh look gpt4 is solving AGI”.
Nobody is even claiming that. It is you who mentioned AGI first. I can tell you that NLP researchers generally do not use the term as much as you think. It currently isn't well defined, so it is largely meaningless.
> Going back on what I said earlier, yes there is always the possibilit
The things worth considering you said are easy to check - you can just provide the logs (you have the history saved) and since GPT4 is as reproducible as ChatGPT, we can confirm or discard your claims. There is no need for uncertainty (unless you will it).
BellyDancerUrgot t1_jdp945d wrote
Claim, since you managed to get lost in your own comment:
Gpt hallucinates a lot and is unreliable for any factual work. It’s useful for creative work when the authenticity of its output doesn’t have to be checked.
Your wall of text can be summarized as, “I’m gonna debate you by suggesting no one knows the definition of AGI.” The living embodiment of the saying “empty vessels make much noise. No one knows what the definition of intuition is but what we know is that memory does not play a part in it. Understanding causality does.
It’s actually hilarious that you bring up source citation as some form of trump card after I mention how everything you know about GPT4 is something someone has told you to believe in without any real discernible and reproducible evidence.
Instead of maybe asking me to spoon feed you spend a whole of 20 secs googling.
https://twitter.com/random_walker/status/1638525616424099841?s=46&t=kwpwSgfnJvGe6J-1CEe_5Q
https://twitter.com/chhillee/status/1635790330854526981?s=46&t=kwpwSgfnJvGe6J-1CEe_5Q
https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
https://aiguide.substack.com/p/did-chatgpt-really-pass-graduate
“I don’t quite get it how works” + “it surprises me” ≠ it could maybe be sentient if I squint.
Thank you for taking the time to write two paragraphs pointing out my error in using the phrase “aces leetcode” after I acknowledged and corrected the mistake myself, maybe you have some word quota you were trying to fulfill with that . Inference time being dependent on length of output sequence has been a constant since the first attention paper let alone the first transformer paper. My point is, it’s good at solving leetcode when it’s present in the training set.
Ps- also kindly refrain from passing remarks on my understanding of the subject when the only arguments you can make are refuting others without intellectual dissent. It’s quite easy to say, “no I don’t believe u prove it” while also not being able to distinguish between Q K and V if it hit u on the face.
suflaj t1_jdqh5se wrote
> Gpt hallucinates a lot and is unreliable for any factual work.
No, I understand that's what you're saying, however, this is not a claim that you can even check. You have demonstrated already that your definitions are not aligned with generally accepted ones (particularly for intuition), so without concrete examples this statement is hard to take into account seriously.
> Your wall of text can be summarized as, “I’m gonna debate you by suggesting no one knows the definition of AGI.”
I'm sad that's what you got from my response. The point was to challenge your claims about whether GPT4 is or isn't AGI based on the mere fact you're judging that over properties which might be irrelevant for the definition. It is sad that you are personally attacking me instead of addressing my concerns.
> No one knows what the definition of intuition is
That is not correct. Here are some definitions of definition:
- an ability to understand or know something immediately based on your feelings rather than fact (Cambridge)
- the power or faculty of attaining to direct knowledge or cognition without evident rational thought and inference (Merriam-Webster)
- a natural ability or power that makes it possible to know something without any proof or evidence : a feeling that guides a person to act a certain way without fully understanding why (Brittanica)
You might notice that all these 3 definitions are satisfied by DL models in general.
> but what we know is that memory does not play a part in it.
This is also not true: https://journals.sagepub.com/doi/full/10.1177/1555343416686476
The question is - why are you making stuff up despite the counterevidence being 1 Google search away?
> It’s actually hilarious that you bring up source citation as some form of trump card after I mention how everything you know about GPT4 is something someone has told you to believe in without any real discernible and reproducible evidence.
I bring it up as you have not provided any other basis for your claims. You refuse to provide the logs for your claims to be checked. Your claims are contrary to my experience, and it seems others' experience as well. You claim things contrary to contemporary science. I do not want to discard your claims outright, I do not want to personally attack you despite being given ample opportunity to do so, I'm asking you to give me something we can discuss and not turn it into "you're wrong because I have a different experience".
> Instead of maybe asking me to spoon feed you spend a whole of 20 secs googling.
I'm not asking you to spoon feed me, I'm asking you to carry your own burden of proof. It's really shameful for a self-proclaimed person in academia to be offended by someone asking them for elaboration.
Now, could you explain what those links mean? The first one, for example, does not help your cause. Not only does it not concern GPT4, but rather Bard, a model significantly less performant than even ChatGPT, it also claims that the model is not actually hallucinating, but not understanding sarcasm.
The second link also doesn't help your cause - rather than examining the generalization potential of a model, it suggest the issue is with the data. It also does not evaluate the newer problems as a whole, but a subset.
The 3rd and 4th links also do not help your cause. First, they do not claim what you are claiming. Second, they list concerns (and I applaud them for at least elaborating a lot more than you), but they do not really test them. Rather than claims, they present hypotheses.
> “I don’t quite get it how works” + “it surprises me” ≠ it could maybe be sentient if I squint.
Yeah. Also note: "I don't quite get how it works" + "It doesn't satisfy my arbitrary criteria on generalization" ≠ It doesn't generalize
> after I acknowledged and corrected the mistake myself
I corrected your correction. It would be great if you could recognize that evaluation the performance on a small subset of problems is not equal to evaluating whether the model aces anything.
> maybe you have some word quota you were trying to fulfill with that
Not at all. I just want to be very clear, given that I am criticisng your (in)ability to clearly present arguments; doing otherwise would be hypocritical.
> My point is, it’s good at solving leetcode when it’s present in the training set.
Of course it is. However, your actual claim was this:
> Also the ones it does solve it solves at a really fast rate.
Your claim suggested that the speed at which it solves it is somehow relevant to the problems it solves correctly. This is demonstrably false, and that is what I corrected you on.
> Ps- also kindly refrain from passing remarks on my understanding of the subject when the only arguments you can make are refuting others without intellectual dissent.
I am not passing these remarks. You yourself claim you are not all that familiar with the topic. Some of your claims have not only cast doubt about your competence on the matter, but now even of the truthfulness of your experiences. For example, I have been beginning to doubt whether you have even used GPT4 given your reluctance to provide your logs.
The arguments I am making is that I don't have the same experience. And that's not only me... Note, however, that I am not confidently saying that I am right or you are wrong - I am, first and foremost, asking you to provide us with the logs so we can check your claims, that for now are contrary to the general public's opinion. Then we can discuss what actually happened.
> It’s quite easy to say, “no I don’t believe u prove it” while also not being able to distinguish between Q K and V if it hit u on the face.
It's also quite easy to copy paste the logs that could save us from what has now turned into a debate (and might soon lead to a block if personal attacks continue), yet here we are.
So I ask you again - can you provide us with the logs that you experienced hallucination with?
EDIT since he (u/BellyDancerUrgot) downvoted and blockedme
> Empty vessels make much noise seems to be a quote u live by. I’ll let the readers of this thread determine who between us has contributed to the discussion and who writes extensively verbose commentary , ironically , with 0 content.
I think whoever reads this is going to be sad. Ultimately, I think you should make sure as little people see this as possible, this kind of approach bring not only shame to your academic career, but also to you as a person. You are young, so you will learn not to be overly enthusiastic in time, though.
BellyDancerUrgot t1_jds7yao wrote
Empty vessels make much noise seems to be a quote u live by. I’ll let the readers of this thread determine who between us has contributed to the discussion and who writes extensively verbose commentary , ironically , with 0 content.
whispering-wisp t1_jdmfbpk wrote
One of the researchers found a little while ago that you could get gpt to hallucinate that it opened urls and was reading or summarizing content. Some of it was RNG.
I believe at least for the urls , it was fixed and it is more consistent about telling you it doesn't have a live feed.
suflaj t1_jdmr9t8 wrote
You can, but not on a fresh chat or without an unpatched jailbreak method. Also, I think you are referring to ChatGPT, which is very different from GPT4. Most researches haven't even had the time to check out GPT4 given it's behind a paywall and limited to 25 requests per 3 hours.
whispering-wisp t1_jdo8kjq wrote
I use "researcher" loosely but I think you are correct. They were pointing out gpt 4 doesn't have the problem.
StrippedSilicon t1_jdmhdac wrote
You are wrong, it does well on problems completely outside of it's training data. There's a good look here: https://arxiv.org/abs/2303.12712
It's obviously not just memorizing, it has some kind of "understanding" to be able to do this.
BellyDancerUrgot t1_jdno8w6 wrote
That paper is laughable and a meme. My twitter feed has been spammed by people tweeting about this paper and as someone in academia it’s sad to see the quality for research publications to be this low. I can’t believe I’m saying this as a student of Deep Learning but Gary Marcus on his latest blogpost is actually right.
StrippedSilicon t1_jdnukc7 wrote
People who point to this paper to claim sentience or AGI or whatever are obviously wrong, it's nothing of the sort. Still, saying that it's just memorizing is also very silly, given it can answer questions that aren't in the training data, or even particularly close to anything in the training data.
BellyDancerUrgot t1_jdpa0mz wrote
Tbf I think I went a bit too far when I said it has everything memorized. But it also has access to an internet worth of contextual information on basically everything that has ever existed. So even though it’s wrong to say it’s 100% memorized, it’s still just intelligently regurgitating information it has learnt with new context. Being able to re-contextualize information isn’t a small feat mind u. I think gpt is amazing just like I found the original diffusion paper and wgans to be. It’s Just really overhyped to be something it isn’t and fails quite spectacularly on logical and factual queries. Cites things that don’t exist, makes simple mistakes but solves more complex ones. Tell tale sign of the model lacking a fundamental understanding of the subject.
StrippedSilicon t1_jdrldvz wrote
Recontextualize information is not unfair, but I'm not sure that it really explains things like the example in 4.4 where it answers a math Olympiad question that there's no way was in the training set (assuming that they're being honest about the training set). I don't know how a model can arrive at the answer it does without some kind of deeper understanding than just putting existing information together in a different order. Maybe the most correct thing is simply to admit we don't really know what's going on since a 100 billion parameters, or however big gpt-4 is, is beyond a simple interpretation.
"Open"AI's recent turn to secrecy isn't helping things either.
BellyDancerUrgot t1_jds7iva wrote
The reason I say it’s a recontextualization and lacks deeper understanding is because it doesn’t hallucinate sometimes , it hallucinates all the time, sometimes the hallucinations align with reality that’s all. Take this thread for eg:
- 
https://twitter.com/ylecun/status/1639685628722806786?s=48&t=kwpwSgfnJvGe6J-1CEe_5Q 
- 
https://twitter.com/stanislavfort/status/1639731204307005443?s=48&t=kwpwSgfnJvGe6J-1CEe_5Q 
- 
https://twitter.com/phillipharr1s/status/1640029380670881793?s=48&t=kwpwSgfnJvGe6J-1CEe_5Q 
A system that fully understood the underlying structure of the question would not give you varying answers with the same prompt.
Inconclusive is the third likeliest answer. Despite having a big bias toward the correct answer (keywords like dubious for eg) it still makes mistakes to a rather simple question. Sometimes it does get it right with the bias sometimes even without the bias.
Language imo lacks causality for intelligence since it’s a mere byproduct of intelligence. Which is why these models imo hallucinate all the time, and sometimes the hallucinations line up with reality and sometimes they don’t. The likelihood of the prior is just increased because of the huge train size.
StrippedSilicon t1_jdt7h5o wrote
So... how does it solve a complicated math problem it hasn't seen before exactly with only regurgitating information?
BellyDancerUrgot t1_jdtci38 wrote
Well let me ask you, how does it fail simple problems if it can solve more complex ones? If you solve these problems analytically then it stands to reason that you wouldn’t be making an error , ever, for a simple question as that.
StrippedSilicon t1_jdte8lj wrote
That's why I'm appealing to "we don't actually understand what it's doing" case. Certainly the AGI-like intelligence explanation falls apart in alot of cases, but the explanation of only spitting out the training data in a different order or context doesn't work either.
FirstOrderCat t1_jdldywl wrote
I think people amazed by progress speed, OpenAI got 10B fundning, they built strong team, and they can continue expanding system with missing components.
BellyDancerUrgot t1_jdlfsuq wrote
Except now it’s closedAI and most of the papers they release are laughably esoteric. I know the world will catch up within months to whatever they pioneer but it’s just funny seeing this happen after they held a sanctimonious attitude for so long.
FirstOrderCat t1_jdlrsxv wrote
first and/or most powerful AGI will likely be closed and owned by corp.
nixed9 t1_jdmpbks wrote
I don’t think this is accurate. Are you sure you were using GPT-4? It’s leaps and bounds better than text-davinci-003 which was chatGPT3.5
BellyDancerUrgot t1_jdnnuii wrote
Iirc bing chat uses gpt4 ?
nixed9 t1_jdnpdma wrote
In my personal experience, Bing Chat, while it says it's powered by GPT-4, is way, way, way less powerful and useful than ChatGPT-4 (which is only available for Pro users right now). I've found ChatGPT-4 SIGNIFICANTLY better.
It also has emergent properties of intelligence, vision, and mapping, somehow. We don't know how.
This paper, which was done on GPT-4, and a more powerful version than what we have access to via either Bing or OpenAI.com, is astounding: https://arxiv.org/pdf/2303.12712.pdf
BellyDancerUrgot t1_jdpb9pi wrote
I agree that Bing chat is not nearly as good as chatgpt4 and I already know everyone is going to cite that paper as a counter to my argument but that paper isn’t reproducible, idek if it’s peer reviewed, it’s lacking a lot of details and has a lot of conjecture. It’s bad literature. Hence even tho the claims are hype, I take it with a bucket full of salt. A lot of scientists I follow in this field have mentioned that even tho the progress is noticeable in terms of managing misinformation, it’s just an incremental improvement and nothing truly groundbreaking.
Not saying OpenAI is 100% lying. But this thread https://twitter.com/katecrawford/status/1638524011876433921?s=46&t=kwpwSgfnJvGe6J-1CEe_5Q by Kate Crawford (msft research ) is a good example of what researchers actually think of claims like these and some of its dangers.
Until I use it for myself personally I won’t know and will have to rely on what I’ve heard from other phds and masters or PostDocs or professors. Personally, The only thing I can compare to is chatgpt and bing chat and both have been far less than stellar in my experience.
Appropriate_Ant_4629 t1_jdnliik wrote
> chatgpt and I’ve had a 50-50 hit rage wrt good results or hallucinated bullshit with both of them
Which just suggests they're not large enough yet to memorize/encode enough of the types of content you're interested in.
BellyDancerUrgot t1_jdpbtyo wrote
Oh I’m sure it had the data. I tested them on a few different things , OOPs, some basic CNN math, some philosophy, some literature reviews, some paper summarization. The last two were really bad. One mistake in CNN math. One mistake in OOPs. Creative things like writing essays or solving technical troubleshooting problems, even niche stuff like how I could shunt a gpu , it managed to answer correctly.
I think people have the idea that I think gpt is shit. On the contrary I think it’s amazing. Just not the holy angel and elixir of life that AI influencers peddle it as.
ChingChong--PingPong t1_jdwfooc wrote
It was not trained on basically the entire internet. Not even close. Even if they trained it on all the pages Google has indexed, that's not even close to the entire internet, and I'm not even talking about the dark web. Toss in all the data behind user accounts, paywalls, intranets. Then toss on all the audio and video on all the social media and audio/video platforms and OpenAI couldn't afford to train, much less optimize, much less host a model of that size.
BellyDancerUrgot t1_jdx6w01 wrote
The implication was, most of accessible textual data. Which is true. The exaggeration was such cuz it’s a language model first and foremost and previous iterations like gpt3 and 3.5 were not multimodal. Also , as far as accounts go, that’s a huge ‘?’ atm. Especially going by tweets like these
https://twitter.com/katecrawford/status/1638524011876433921?s=46&t=kwpwSgfnJvGe6J-1CEe_5Q
The reality is , we and you don’t have the slightest clue regarding what it was trained on and msft has sufficient compute to train on all of the text data on the internet.
When it comes to multimodal media we don’t really need to train a model on the same amount of data required for text.
Viewing a single comment thread. View all comments