cameldrv

cameldrv t1_jdldgo8 wrote

I think that fine tuning has its place, but I don't think you're going to be able to replicate the results of a 175B parameter model with a 6B one, simply because the 100B model empirically just holds so much information.

If you think about it from an information theory standpoint, all of that specific knowledge has to be encoded in the model. If you're using 8 bit weights, your 6B parameter model is 6GB. Even with incredible data compression, I don't think that you can fit anywhere near the amount of human knowledge that's in GPT-3.5 into that amount of space.

10