Submitted by Technologenesis t3_125wzvw in singularity
UseNew5079 t1_je6lgyb wrote
Maybe 7b model can get GPT-4 level performance if trained for _very_ long. Facebook paper showed that performance increased until the end of training and it looks like there was no plateau. Maybe it's just very inefficient but possible? Or maybe there is another way.
Akimbo333 t1_je9proo wrote
Why does performance increase with training instead of parameters?
UseNew5079 t1_je9wrw6 wrote
Check LLama paper: https://arxiv.org/pdf/2302.13971.pdf
Specifically this graph: https://paste.pics/6f817f0aa71065e155027d313d70f18c
They increase performance (reduce loss) with parameters or training time. More parameters just allow for faster and deeper initial drop in error/loss but later part looks the same for all model sizes. At least that is my interpretation.
Viewing a single comment thread. View all comments