Recent comments in /f/deeplearning

BrotherAmazing t1_j52hucj wrote

If OP asked this question in a court of law, the attorney would immediately yell “OBJECTION!” and the Judge would sustain, scold OP, but give them a chance to ask a question that doesn’t automatically pre-suppose and imply that pre-training cannot be “correct” or that there is always a “better” way than pre-training.

FWIW, I often avoid transfer learning or pre-training when it’s not needed, but I’m sure I could construct a problem that is not pathological and of practical importance where pre-training is “optimal” in some sense of that word.

2

Blacky372 t1_j521lej wrote

I like your article, thank you for sharing.

But writing "no spam, no nonsense" is a little weird to me if I get this when trying to subscribe.

Don't get me wrong, it's fine to monetize your content and to use your followers data to present them personalized ads. Acting like you're just enthusiastic about sharing info at the same time doesn't really fit.

3

JohnFatherJohn t1_j5161hj wrote

People will be disappointed because they don't understand the relationship between model complexity and performance. There's so many irresponsible and/or uneducated articles suggesting that the orders of magnitude increase in the number of parameters will translate to orders of magnitude performance gains, which is obviously wrong.

6

sEi_ t1_j50efga wrote

About GPT-4 From the horses mouth:

Interview with Sam Altman (CEO OpenAI) from 2 days ago (17 jan).

Article in Verge:

>"OpenAI CEO Sam Altman on GPT-4: ‘people are begging to be disappointed and they will be’"

https://www.theverge.com/23560328/openai-gpt-4-rumor-release-date-sam-altman-interview

Video with the interview in 2 parts:

>StrictlyVC in conversation with Sam Altman

https://www.youtube.com/watch?v=57OU18cogJI&ab_channel=ConnieLoizos

6

LesleyFair OP t1_j501xt6 wrote

First, thanks a lot for reading and thank you for the good questions:

A1) Current GPT-3 is 175B parameters. If GPT-4 would be 100T parameters, it would be a scale-up of roughly 500x.

A2) I got the calculation from the paper for the Turing NLG model. The total training time in seconds is reached by multiplying the number of tokens by the number of model parameters. That number is then divided by the number of GPUs times each GPU's FLOPs per second.

8

--dany-- t1_j4zx6lf wrote

Very good write up! Thanks for sharing your thoughts and observations. Some questions many other folks may have as well

  1. how do you arrive at the number it’s 500x smaller or 200 million parameters?
  2. Your estimate of 53 years for training a 100T model, can you elaborate how you got 53?
10

shironorey OP t1_j4zqttn wrote

I've done some project on plant leaf classification on android and surprisingly it works fine without straining even an old phone (using mobilenetv3) and I've also heard about computation problem before regarding deep learning especially on android quite frequent. But nice idea though could actually be an alternative for my problem. Will definetily look further on it.

1

junetwentyfirst2020 t1_j4wrzut wrote

The way I like to think about this is that the algorithm has to model many things. If you’re trying to learn whether the image contains a dog or not, first you have to model natural imagery, correlations between features, and maybe even a little 2D-to-3D to simplify invariances. I’m speaking hypothetically here, because the underlying model is quite latent and hard to inspect.

If you train from scratch you need to do all of these tasks on a dataset that is likely much smaller than is required to do all of them without overfitting. If you use a pretrained model, instead of learning all of those tasks, you instead have a model that only has to learn just one additional thing on the same amount of data.

1

ruphan t1_j4winzi wrote

It is definitely possible. Let me give an analogy first. In the context of education, let's assume our pretrained model is a person with multiple STEM degrees in fields like neuroscience, math etc.. And let your model that's trained from scratch be someone with no degree yet. We have a limited amount of resources like a couple of textbooks on deep learning. It's intuitive that the first person should not only be able to pick up deep learning faster but also be better than the latter, given that they have a better understanding of the fundamentals and experience.

To extend this analogy to your case, I believe that the pretrained model must be quite big for the limited amount of new data that you have. The pretrained model would have developed a better set of filters that just couldn't be learned with a relatively small dataset for a big model trained form scratch. This is just like the analogy where it doesn't matter if neuroscience and math are not exactly deep learning, having the fundamentals strong by pretraining on millions of images makes that model achieve better accuracy.

Maybe if you have a bigger fine-tuning dataset, this gap in accuracy should diminish eventually.

2