Recent comments in /f/MachineLearning

ZestyData t1_jeagwa8 wrote

Thank you for repeating half of what I said back to me, much like ChatGPT you catch on quick to new information:

So, let's be clear here then. Contrary to your incorrect first comment; Google translate is an LLM, it is autoregressive, and it is pretrained. At least to the definition of pre-training given in the GPT paper, which was the parallel I first used in my own comment for OP who was coming into this thread with the knowledge of the latest GPT3+ and ChatGPT products.

​

>It's funny how you mention unrelated stuff, like RLHF

I did so because I had naively assumed you were also a newcomer to the field who knew nothing outside of ChatGPT, given how severely wrong your first comment was. I'll grant you that it wasn't related, except to lend an olive branch and reasonable exit-plan if that were the case for you. Alas.

​

>LLMs tend to be >>1B parameter models

Again, no. Elmo was 94 million, GPT was 120 milliom, GPT-2 was 1.5 billion. BERT has ~300 million parameters. These are all Large Language Models and have been called so for years.There is no hard definition on what constitutes "large". 2018's large is nearly today's consumer-hardware level. Google Translate (and its search) are a few of the most well-used LLMs actually out there.

Man. Why do you keep talking about things that you don't understand, even when corrected?

​

>Lastly, modelling p(y|x) is significantly easier and thus less general than modelling p(x).

Sure! It is easier! But that's not what you said. You'd initially brought up P(Y|X) as a justification that Translation isn't pre-trained. Those are two unrelated concepts. Its ultimate modelling goal is P(Y|X) but in both GPT (Generative Pre-training) and Google translate, they both pretrain their ability to predict P(X|context) in the decoder, just like any hot new LLM of today, hence my correction for you. The application towards ultimate P(Y|X) is not connected to the pretraining of their decoders.

1

Barton5877 t1_jeadc3o wrote

On 2:

Competence is used sociologically to describe ability to perform, such as speak or act, in a manner demonstrating some level of mastery - but isn't necessarily a sign of understanding.

I'd be loathe to have to design a metric or assessment by which to "measure" understanding. One can measure or rate competence - the degree to which the person "understands" what they are doing, why, how, for what purpose and so on is another matter.

In linguistics, there's also a distinction between practical and discursive reason that can be applied here: ability to reason vs ability to describe the reasoning. Again, understanding escapes measurement, insofar as what we do and how we know what we are doing isn't the same as describing it (which requires both reflection on our actions and translation into speech that communicates them accurately).

The long and short of it being that "understanding" is never going to be the right term for us to use.

That said, there should be terminology for describing the conceptual connectedness that LLMs display. Some of this is in the models and design. Some of it is in our projection and psychological interpretation of their communication and actions.

I don't know to what degree LLMs have "latent" conceptual connectedness, or whether this is presented only in the response to prompts.

3

pier4r t1_jead39m wrote

As a semi layman, while I was amazed by the progress in ML, I was skeptical of every increasing models, needing more and more parameters to do good. I felt like "more parameters can improve things, then other factor follows".

I asked myself whether there was any effort in being more efficient shrinking things and recently I read about LLAMA and I realized that that direction is now pursued as well.

1

ChuckSeven t1_jeab590 wrote

It's funny how you mention unrelated stuff, like RLHF, which has nothing to do with the point of discussion. A bit like an LLM I reckon.

See, Google translate models are (as far as publicly known) trained on a parallel corpus. This is supervised data since it provides the same text in different languages. The model is trained to model, e.g. p(y=German|x=English). There is much less supervised data available which means that the models you train will be significantly smaller. Note that translation models are usually only auto-regressive in the decoding part. The encoder part, which usually makes up about 50% of the parameters, is not auto-regressive.

LLMs tend to be >>1B parameter models trained on billions or trillions of tokens. The vast amount of data is believed to be necessary to train such large models. The models are modelling p(x) which in some cases is monolingual or virtually so. An LLM that is trained on a vast but only English corpus will not be capable of translating at all. LLM trained on a multi-lingual corpus can be prompted to translate but they are far inferior to actual translation models.

Lastly, modelling p(y|x) is significantly easier and thus less general than modelling p(x).

−4

cc-test t1_jea9ejf wrote

>Then you are asking them to waste time.

Having inexperienced staff gain more knowledge about languages and tooling in the context of the codebases they work in isn't a waste of time.

Sure, for example, I'm not going to explain every function in each library or package that we use, and will point juniors towards the documentation. Equally, I'm not going to say "hey ask ChatGPT instead of just looking at the docs", mainly because ChatGPT's knowledge is out of date and the junior would likely be getting outdated information.

>The first time, I wasted 30 minutes trying to interpret an extremely obscure error message, then asked my colleague, then kicked myself because I had run into the same problem six months ago.

So you weren't learning a new language or codebase, you were working with something you already knew. I don't care if anyone, regardless of seniority, uses GPT or any other LLM or any type of model for that matter to solve problems with. You were able to filter through the incorrect outputs or less than ideal outputs and arrive at the solution that suited the problem best.

How are you supposed to do that when you have no foundation to work with?

I do care about people new to a subject matter using it to learn because of the false positives the likes of ChatGPT can spew out.

Telling a junior to use ChatGPT to learn something new is just lazy mentoring and I'd take that as a red flag for any other senior or lead I found doing that.

1

ZestyData t1_jea73i3 wrote

LLM simply means Large Language Model. A language model with a large number of parameters. LLMs have referred to all sorts of deep learning architectures over the past 20 years.

Google invented the Transformer architecture, and most importantly discovered how well transformers scale in power as they scale in size. This invention kickstarted the new arms race of LLMs to refer to transformer models with large numbers of parameters.

Google translate's current Prod architecture is a (large) transformer to encode, and an RNN to decode.[1] This falls into the category of LLMs - which weren't just invented when OpenAI invented RLHF at the end of 2022 and published ChatGPT. GPT is the same, but uses transformers for both the encoder & decoder.

The decoding RNN in google translate absolute is an autoregressive model.

I re-read the original GPT paper[2] to try and get a better understanding of the actual "pre-training" term here and I genuinely can't see a difference between that and what Google write about in their papers & blogs [3]; it just defines X & Y differently but they're both predicting a token based on the context window. GPT calls it pretraining because it does an additional step after learning P(X | context). But both approaches perform this fundamental autoregressive training.

[1] - https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html

[2] - https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[3] - https://arxiv.org/pdf/1609.08144.pdf

4