Recent comments in /f/MachineLearning

Smallpaul t1_jea4whk wrote

>You get a zero cost tutor that may or may not be correct about something objective, and as a student you are supposed to trust that?

No. I did not say to trust that.

Also: if you think that real teachers never make mistakes, you're incorrect yourself. My kids have textbooks full of errata. Even Donald Knuth issues corrections for his books (rarely).

>I also pay, well my company does, to access GPT-4 and it's still not that close to being a reliable tutor. I wouldn't tell my juniors to ask ChatGPT about issues they are having instead of asking me or another of the seniors or lead engineer.

Then you are asking them to waste time.

I am "junior" on a particular language and I wasted a bunch of time on a problem because I don't want to bug the more experience person every time I have a problem.

The situation actually happened twice in one day.

The first time, I wasted 30 minutes trying to interpret an extremely obscure error message, then asked my colleague, then kicked myself because I had run into the same problem six months ago.

Then I asked ChatGPT4, and it gave me six possible causes. Which included the one that I had seen before. Had I asked GPT4, I would have saved myself 30 minutes and saved my colleague an interruption.

The second time, I asked ChatGPT4 directly. It gave me 5 possible causes. Using process of elimination I immediately knew which it was. Saved me trying to figure it out for myself before interrupting someone else.

You are teaching your juniors to be helpless instead of teaching them how to use tools appropriately.

> Code working is not equivocal to the code being written correctly or well. If you're the kind of engineer that just think "oh well it works at least, that's good enough" then you're the kind of engineer who will be replaced by AI tooling in the near future.

One of the ways you can use this tool is to ask it how to make the code more reliable, easier to read, etc.

If you use the tool appropriately, it can help with that too.

0

JustOneAvailableName t1_jea2dzf wrote

Software engineer perspective on attention (self quote):

> You have to think about searching. If you search, you have a query (the search term), some way to correlate the query to the actual (size unknown/indifferent) knowledge base and the knowledge base itself. If you have to write this as a mathematical function you have to have something that matches a query, to how similar it is to some key and then return the corresponding value to that key. The transformer equation is a pretty straightforward formula from that perspective. Each layers learns what it searches for, how it can be found and which value it wants to transfer when requested.

RWKV changes this by removing the query. So data is not requested anymore, only pushed. I am frankly surprised to seems to work thus far. Pushing data (self determining how important something is for something else) is not dependant on other states, enabling it to be a RNN.

Edit: step I need to mention: in RWKV importance also fades over time, so it has a recency bias

3

ChuckSeven t1_jea2b99 wrote

Google translate is certainly not an LLM. LLM can do translation but they are significantly worse than translation models trained on translation data. They have an encoder-decoder architecture as it is a sequence-to-sequence model and not a autoregressive architecture like LLMs do.

They are also not pretrained afaik. Since language modelling is modelling p(x) but translation is p(y|x).

−6

LetGoAndBeReal t1_jea1id9 wrote

I believe you are referring to this statement from the link: "Ability to train on more examples than can fit in a prompt." Correct?

If so, as I explained, the key word here is "examples." And if you understand why, you will see that there is no contradiction. I will try to clarify why.

There are two methods that we are discussing for extending the capability of an LLM:

  1. Prompt engineering
  2. Fine-tuning

There are also different types of capability that might be extended. We are discussing the following two:

  1. Adding new knowledge/facts to the model
  2. Improving downstream processing tasks, such as classification, sentiment analysis, etc.

Both of these capabilities are readily done through prompt engineering. Adding new knowledge with prompt engineering involves including that knowledge as context in the prompt. Improving tasks such as classification is done by include examples of the processing you want done in the prompt.

What the article says is that for the case where you want to provide examples in the prompt to make the model perform better, you can alternatively use fine-tuning. The article does not say "Ability to add more knowledge than can fit in a prompt." Examples = downstream processing tasks. Examples != new knowledge.

1

DigThatData t1_jea10sg wrote

yeah... i hate to say it but I agree with the other commenters. If you have access to medical support, I strongly recommend you get seen by a physician. I'm concerned you might be experiencing some kind of psychiatric episode. If you're skeptical that's fine, you can even tell them that.

> "Strangers on the internet expressed concern that I might be experiencing a psychiatric episode of some kind. I don't see it, but enough people suggested it that I felt it merited a professional opinion, so here I am."

6

WokeAssBaller t1_jea0o2f wrote

Again you are using an incredibly limited definition of fine tuning based on what the open ai api allows, which once again tells me you don’t know ML.

Fine tuning is ANY additional training on a foundational model, this can be MLM training on the model base or selectively training the subsequent layers.

OF COURSE this can add knowledge as you are doing the same training that got it knowledge in the first place. Glad to see you jumped on the chatgpt band wagon last week, build a transformer from scratch and come talk to me

2

LetGoAndBeReal t1_je9zfyb wrote

>And there is no reason why the same methods wouldn't work on LLMs too, for example there is already Lora for LLMs too.

It's really not helpful to make strong assertions like this without referring to specific, verifiable sources. Fine-tuning very typically is done in a way where certain layers/parameters of the model are frozen. This is done to avoid the sort of loss we are discussing. The LoRA paper itself states that LoRA "freezes the pre-trained model weights".

0

xander76 OP t1_je9w2ax wrote

Yeah, that's definitely one of the things it offers right now. If you want a particular data shape out of GPT, we handle that, both on the side of crafting the prompt to elicit the type and on the parsing side to get the data out of the raw GPT response.

We're also building more tools to make the development process easier, which depend on the fact that imaginary functions are easy to do static analysis on. The first tool is an IDE plugin that lets you directly run and test imaginary functions in VS Code and to compare different versions of an imaginary function to see how they do on various test inputs. We also plan to add simple annotations to the comment format to let you easily switch to other LLMs for your runtime to manage the cost/quality/privacy tradeoff.

ETA: One thing it also does right now is lets you switch between models (ada, babbage, curie, davinci, gpt-3.5-turbo, gpt-4) with just a configuration switch. If you use OpenAI's APIs you need to change your client code, because the GPT-3 models have a different API than GPT-3.5 and GPT-4.

2

Goldenier t1_je9uruu wrote

This is false, and actually most of the time the opposite is the problem: the model learns too much of the new data it's finetuned on (overfitting on it), but forgets the "knowledge" in the original model. The simplest and most popularly used example right now is when you use the dreambooth, Lora or other finetuning methods to finetune parts of the big image diffusion models and if you overtrain it will place the newly trained face or object in almost all of it's output, so it easily learns new data but also easily forgets old one. ( One mitigation for this is to use preservation loss to make sure it also keeps the old knowledge. ) And there is no reason why the same methods wouldn't work on LLMs too, for example there is already Lora for LLMs too.

2